FashionViL: Fashion-Focused Vision-and-Language Representation Learning

Han, Xiao; Yu, Licheng; Zhu, Xiatian; Zhang, Li; Song, Yi-Zhe; Xiang, Tao

doi:10.1007/978-3-031-19833-5_37

Xiao Han^12,13,
Licheng Yu¹⁴,
Xiatian Zhu^12,15,
Li Zhang¹⁶,
Yi-Zhe Song^12,13 &
…
Tao Xiang^12,13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13695))

Included in the following conference series:

European Conference on Computer Vision

2107 Accesses
10 Citations

Abstract

Large-scale Vision-and-Language (V+L) pre-training for representation learning has proven to be effective in boosting various downstream V+L tasks. However, when it comes to the fashion domain, existing V+L methods are inadequate as they overlook the unique characteristics of both fashion V+L data and downstream tasks. In this work, we propose a novel fashion-focused V+L representation learning framework, dubbed as FashionViL. It contains two novel fashion-specific pre-training tasks designed particularly to exploit two intrinsic attributes with fashion V+L data. First, in contrast to other domains where a V+L datum contains only a single image-text pair, there could be multiple images in the fashion domain. We thus propose a Multi-View Contrastive Learning task for pulling closer the visual representation of one image to the compositional multimodal representation of another image+text. Second, fashion text (e.g., product description) often contains rich fine-grained concepts (attributes/noun phrases). To capitalize this, a Pseudo-Attributes Classification task is introduced to encourage the learned unimodal (visual/textual) representations of the same concept to be adjacent. Further, fashion V+L tasks uniquely include ones that do not conform to the common one-stream or two-stream architectures (e.g., text-guided image retrieval). We thus propose a flexible, versatile V+L model architecture consisting of a modality-agnostic Transformer so that it can be flexibly adapted to any downstream tasks. Extensive experiments show that our FashionViL achieves new state of the art across five downstream tasks. Code is available at https://github.com/BrandonHanx/mmf.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
A single-stream model can also be applied at a cost of traversing every query-gallery pair, resulting in unacceptable retrieval speed in large-scale applications.
2.
We randomly dropout some words in \(\textbf{w}\) and patches in \(\textbf{d}\) with the probability of 15% to make the learning process more robust.
3.
Following UNITER, we use conditional masking for MLM/MPFC, i.e., only masking one modality while keeping the other one intact at each time.
4.
Following BERT and UNITER, we decompose this 15% into 10% random words, 10% unchanged, and 80% [MASK].
5.
In the 101 images, 1 is positively paired with the text and the other 100 are randomly paired but sharing the same sub-category as the positive, increasing the difficulty.
6.
Because the authors did not release their 1K retrieval set, we report the average recall of 5 experiments with 5 randomly selected 1K retrieval sets.
7.
Details for the reproduction of previous methods are in the supplementary file.
8.
We have no access to the data splits of CSA-Net, so constructed the Polyvore Outfits [58] and reproduced CSA-Net by ourselves according to the original paper [25, 38].

References

Akbari, H., et al.: Vatt: transformers for multimodal self-supervised learning from raw video, audio and text. In: NeurIPS (2021)
Google Scholar
Antol, S., et al.: VQA: Visual question answering. In: ICCV (2015)
Google Scholar
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Bao, H., Dong, L., Piao, S., Wei, F.: Beit: bert pre-training of image transformers. In: ICLR (2022)
Google Scholar
Bird, S., Klein, E., Loper, E.: Natural language processing with python: analyzing text with the natural language toolkit (2009). https://www.nltk.org
Bugliarello, E., Cotterell, R., Okazaki, N., Elliott, D.: Multimodal pretraining unmasked: a meta-analysis and a unified framework of vision-and-language berts. TACL (2021)
Google Scholar
Chen, Y., Gong, S., Bazzani, L.: Image search with text feedback by visiolinguistic attention learning. In: CVPR (2020)
Google Scholar
Chen, Y.C., et al.: Uniter: universal image-text representation learning. In: ECCV (2020)
Google Scholar
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP (2014)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (2019)
Google Scholar
Dong, X., et al.: M5product: a multi-modal pretraining benchmark for e-commercial product downstream tasks. arXiv preprint arXiv:2109.04275 (2021)
Dong, X., et al.: Peco: perceptual codebook for bert pre-training of vision transformers. arXiv preprint arXiv:2111.12710 (2021)
Dosovitskiy, A., et al.: An image is worth 16 x 16 words: transformers for image recognition at scale. In: ICLR (2020)
Google Scholar
Dou, Z.Y., et al.: An empirical study of training end-to-end vision-and-language transformers. arXiv preprint arXiv:2111.02387 (2021)
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: CVPR (2021)
Google Scholar
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: Vse++: improving visual-semantic embeddings with hard negatives. In: BMVC (2018)
Google Scholar
Fei, N., et al.: Wenlan 2.0: make AI imagine via a multimodal foundation model. arXiv preprint arXiv:2110.14378 (2021)
Gao, D., et al.: Fashionbert: text and image matching with adaptive loss for cross-modal retrieval. In: SIGIR (2020)
Google Scholar
Geigle, G., Pfeiffer, J., Reimers, N., Vulić, I., Gurevych, I.: Retrieve fast, rerank smart: cooperative and joint approaches for improved cross-modal retrieval. arXiv preprint arXiv:2103.11920 (2021)
Guo, X., Wu, H., Cheng, Y., Rennie, S., Tesauro, G., Feris, R.S.: Dialog-based interactive image retrieval. In: NeurIPS (2018)
Google Scholar
Han, X., He, S., Zhang, L., Song, Y.Z., Xiang, T.: UIGR: unified interactive garment retrieval. In: CVPR workshops (2022)
Google Scholar
Han, X., et al.: Automatic spatially-aware fashion concept discovery. In: ICCV (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Hoe, J.T., Ng, K.W., Zhang, T., Chan, C.S., Song, Y.Z., Xiang, T.: One loss for all: deep hashing with a single cosine similarity based learning objective. In: NeurIPS (2021)
Google Scholar
Hou, Y., Vig, E., Donoser, M., Bazzani, L.: Learning attribute-driven disentangled representations for interactive fashion retrieval. In: ICCV (2021)
Google Scholar
Hu, X., et al.: Vivo: visual vocabulary pre-training for novel object captioning. In: AAAI (2021)
Google Scholar
Huang, Z., Zeng, Z., Liu, B., Fu, D., Fu, J.: Pixel-bert: aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849 (2020)
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML (2021)
Google Scholar
Kim, W., Son, B., Kim, I.: VILT: Vision-and-language transformer without convolution or region supervision. In: ICML (2021)
Google Scholar
Lee, S., Kim, D., Han, B.: Cosmo: Content-style modulation for image retrieval with text feedback. In: CVPR (2021)
Google Scholar
Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. In: NeurIPS (2021)
Google Scholar
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)
Li, L.H., You, H., Wang, Z., Zareian, A., Chang, S.F., Chang, K.W.: Unsupervised vision-and-language pre-training without parallel images and captions. In: NAACL-HLT (2021)
Google Scholar
Li, W., et al.: Unimo: towards unified-modal understanding and generation via cross-modal contrastive learning. In: ACL-IJCNLP (2021)
Google Scholar
Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: ECCV (2020)
Google Scholar
Liao, L., He, X., Zhao, B., Ngo, C.W., Chua, T.S.: Interpretable multimodal retrieval for fashion products. In: ACM MM (2018)
Google Scholar
Lin, T.Y., et al.: Microsoft coco: common objects in context. In: ECCV (2014)
Google Scholar
Lin, Y.L., Tran, S., Davis, L.S.: Fashion outfit complementary item retrieval. In: CVPR (2020)
Google Scholar
Liu, H., Yu, T., Li, P.: Inflate and shrink: enriching and reducing interactions for fast text-image retrieval. In: EMNLP (2021)
Google Scholar
Liu, Z., Rodriguez-Opazo, C., Teney, D., Gould, S.: Image retrieval on real-life images with pre-trained vision-and-language models. In: ICCV (2021)
Google Scholar
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: NeurIPS (2019)
Google Scholar
Ma, Y., Jia, J., Zhou, S., Fu, J., Liu, Y., Tong, Z.: Towards better understanding the clothing fashion styles: a multimodal deep learning approach. In: AAAI (2017)
Google Scholar
Van der Maaten, L., Hinton, G.: Visualizing data using T-SNE. JMLR (2008)
Google Scholar
Oord, A.V.D., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: NeurIPS (2019)
Google Scholar
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: ICCV (2015)
Google Scholar
Qi, D., Su, L., Song, J., Cui, E., Bharti, T., Sacheti, A.: Imagebert: cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966 (2020)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Ramesh, A., et al.: Zero-shot text-to-image generation. In: ICML (2021)
Google Scholar
Rostamzadeh, N., et al.: Fashion-gen: the generative fashion dataset and challenge. arXiv preprint arXiv:1806.08317 (2018)
Shin, M., Cho, Y., Ko, B., Gu, G.: Rtic: Residual learning for text and image composition using graph convolutional network. arXiv preprint arXiv:2104.03015 (2021)
Singh, A., et al.: MMF: a multimodal framework for vision and language research (2020). https://github.com/facebookresearch/mmf
Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. In: ICLR (2020)
Google Scholar
Sun, S., Chen, Y.C., Li, L., Wang, S., Fang, Y., Liu, J.: Lightningdot: pre-training visual-semantic embeddings for real-time image-text retrieval. In: NAACL-HLT (2021)
Google Scholar
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: EMNLP-IJCNLP (2019)
Google Scholar
Tan, R., Vasileva, M.I., Saenko, K., Plummer, B.A.: Learning similarity conditions without explicit supervision. In: ICCV (2019)
Google Scholar
Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: NeurIPS (2017)
Google Scholar
Vasileva, M.I., Plummer, B.A., Dusad, K., Rajpal, S., Kumar, R., Forsyth, D.: Learning type-aware embeddings for fashion compatibility. In: ECCV (2018)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Vo, N., Jiang, L., Sun, C., Murphy, K., Li, L.J., Fei-Fei, L., Hays, J.: Composing text and image for image retrieval - an empirical odyssey. In: CVPR (2019)
Google Scholar
Wang, J., et al.: UFO: a unified transformer for vision-language representation learning. arXiv preprint arXiv:2111.10023 (2021)
Wang, W., Bao, H., Dong, L., Wei, F.: VLMO: unified vision-language pre-training with mixture-of-modality-experts. arXiv preprint arXiv:2111.02358 (2021)
Wang, Z., Wang, W., Zhu, H., Liu, M., Qin, B., Wei, F.: Distilled dual-encoder model for vision-language understanding. arXiv preprint arXiv:2112.08723 (2021)
Wang, Z., Yu, J., Yu, A.W., Dai, Z., Tsvetkov, Y., Cao, Y.: Simvlm: simple visual language model pretraining with weak supervision. In: ICLR (2021)
Google Scholar
Wu, Het al.: Fashion IQ: a new dataset towards retrieving images by natural language feedback. In: CVPR (2021)
Google Scholar
Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)
Xu, H., et al.: E2E-VLP: End-to-end vision-language pre-training enhanced by visual learning. In: ACL-IJCNLP (2021)
Google Scholar
Yang, X., et al.: Fashion captioning: towards generating accurate descriptions with semantic rewards. In: ECCV (2020)
Google Scholar
You, H., et al.: Ma-clip: towards modality-agnostic contrastive language-image pre-training. OpenReview (2021)
Google Scholar
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: ECCV (2016)
Google Scholar
Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: Visual commonsense reasoning. In: CVPR (2019)
Google Scholar
Zhang, L., et al.: Vldeformer: learning visual-semantic embeddings by vision-language transformer decomposing. arXiv preprint arXiv:2110.11338 (2021)
Zhang, P., et al.: VINVL: revisiting visual representations in vision-language models. In: CVPR (2021)
Google Scholar
Zhang, Z., et al: UFC-bert: unifying multi-modal controls for conditional image synthesis. In: NeurIPS (2021)
Google Scholar
Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., Gao, J.: Unified vision-language pre-training for image captioning and VQA. In: AAAI (2020)
Google Scholar
Zhu, Y., et al.: Knowledge perceived multi-modal pretraining in e-commerce. In: ACM MM (2021)
Google Scholar
Zhuge, M., et al.: Kaleido-bert: vision-language pre-training on fashion domain. In: CVPR (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, England
Xiao Han, Xiatian Zhu, Yi-Zhe Song & Tao Xiang
iFlyTek-Surrey Joint Research Centre on Artificial Intelligence, Guildford, Surrey, England
Xiao Han, Yi-Zhe Song & Tao Xiang
Meta AI, 1 Hacker Way, Menlo Park, CA, 94025, USA
Licheng Yu
Surrey Institute for People-Centred Artificial Intelligence, University of Surrey, Guildford, England
Xiatian Zhu
School of Data Science, Fudan University, Shanghai, China
Li Zhang

Authors

Xiao Han
View author publications
You can also search for this author in PubMed Google Scholar
Licheng Yu
View author publications
You can also search for this author in PubMed Google Scholar
Xiatian Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Li Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yi-Zhe Song
View author publications
You can also search for this author in PubMed Google Scholar
Tao Xiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiao Han .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5901 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Han, X., Yu, L., Zhu, X., Zhang, L., Song, YZ., Xiang, T. (2022). FashionViL: Fashion-Focused Vision-and-Language Representation Learning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13695. Springer, Cham. https://doi.org/10.1007/978-3-031-19833-5_37

Download citation

DOI: https://doi.org/10.1007/978-3-031-19833-5_37
Published: 04 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19832-8
Online ISBN: 978-3-031-19833-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics