Skip to main content

Advertisement

SpringerLink
  • Machine Intelligence Research
  • Journal Aims and Scope
  • Submit to this journal
Masked Vision-language Transformer in Fashion
Download PDF
Your article has downloaded

Similar articles being viewed by others

Slider with three articles shown per slide. Use the Previous and Next buttons to navigate the slides or the slide controller buttons at the end to navigate through each slide.

Fill in the blank for fashion complementary outfit product Retrieval: VISUM summer school competition

30 December 2022

Eduardo Castro, Pedro M. Ferreira, … Sofia Beco

A single-stage fashion clothing detection using multilevel visual attention

28 December 2022

Shajini Majuran & Amirthalingam Ramanan

MMFL-net: multi-scale and multi-granularity feature learning for cross-domain fashion retrieval

22 August 2022

Chen Bao, Xudong Zhang, … Yongwei Miao

Dual attention composition network for fashion image retrieval with attribute manipulation

09 November 2022

Yongquan Wan, Guobing Zou, … Bofeng Zhang

Disentangled Representation Learning of Makeup Portraits in the Wild

11 December 2019

Yi Li, Huaibo Huang, … Tieniu Tan

Contrastive language and vision learning of general fashion concepts

08 November 2022

Patrick John Chia, Giuseppe Attanasio, … Jacopo Tagliabue

FRSFN: A semantic fusion network for practical fashion retrieval

10 May 2020

An-An Liu, Ting Zhang, … Ming Zhou

Fashion sub-categories and attributes prediction model using deep learning

07 June 2022

Muhammad Shoib Amin, Changbo Wang & Summaira Jabeen

LGVTON: a landmark guided approach for model to person virtual try-on

08 January 2022

Debapriya Roy, Sanchayan Santra & Bhabatosh Chanda

Download PDF
  • Research Article
  • Open Access
  • Published: 27 February 2023

Masked Vision-language Transformer in Fashion

  • Ge-Peng Ji  ORCID: orcid.org/0000-0001-7092-28771 na1,
  • Mingchen Zhuge  ORCID: orcid.org/0000-0003-2561-77121 na1,
  • Dehong Gao  ORCID: orcid.org/0000-0002-6636-57021,
  • Deng-Ping Fan  ORCID: orcid.org/0000-0002-5245-75182,
  • Christos Sakaridis  ORCID: orcid.org/0000-0003-1127-88872 &
  • …
  • Luc Van Gool  ORCID: orcid.org/0000-0002-3445-57112 

Machine Intelligence Research (2023)Cite this article

  • 68 Accesses

  • Metrics details

Abstract

We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize the vision transformer architecture for replacing the bidirectional encoder representations from Transformers (BERT) in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image reconstruction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen 2018 winner, Kaleido-BERT. The code is available at https://github.com/GewelsJI/MVLT.

Download to read the full article text

Working on a manuscript?

Avoid the most common mistakes and prepare your manuscript for journal editors.

Learn more

References

  1. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. H. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations, 2021.

  2. Z. Liu, Y. T. Lin, Y. Cao, H. Hu, Y. X. Wei, Z. Zhang, S. Lin, B. N. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 9992–10002, 2021. DOI: https://doi.org/10.1109/ICCV48922.2021.00986.

    Google Scholar 

  3. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, US, pp. 6000–6010, 2017.

  4. T. X. Sun, X. Y. Liu, X. P. Qiu, X. J. Huang. Paradigm shift in natural language processing. Machine Intelligence Research, vol. 19, no. 3, pp. 169–183, 2022. DOI: https://doi.org/10.1007/s11633-022-1331-6.

    Article  Google Scholar 

  5. S. Agarwal, G. Krueger, J. Clark, A. Radford, J. W. Kim, M. Brundage. Evaluating CLIP: Towards characterization of broader capabilities and downstream implications, [Online], Available: https://arxiv.org/abs/2108.02818, August 05, 2021.

  6. M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, I. Sutskever. Generative pretraining from pixels. In Proceedings of the 37th International Conference on Machine Learning, Article number 233, 2020.

  7. J. Y. Lin, R. Men, A. Yang, C. Zhou, Y. C. Zhang, P. Wang, J. R. Zhou, J. Tang, H. X. Yang. M6: Multi-modality-to-multi-modality multitask mega-transformer for unified pretraining. In Proceedings of the 27th ACM SIGK-DD Conference on Knowledge Discovery & Data Mining, Association for Computing Machinery, 2021. DOI: https://doi.org/10.1145/3447548.3467206.

  8. A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, I. Sutskever. Zero-shot text-to-image generation. In Proceedings of the 38th International Conference on Machine Learning, pp. 8821–8831, 2021.

  9. H. Wu, Y. P. Gao, X. X. Guo, Z. Al-Halah, S. Rennie, K. Grauman, R. Feris. Fashion IQ: A new dataset towards retrieving images by natural language feedback. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 11302–11312, 2021. DOI: https://doi.org/10.1109/CVPR46437.2021.01115.

    Google Scholar 

  10. J. Devlin, M. W. Chang, K. Lee, K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, USA, pp.4171–4186, 2019. DOI: https://doi.org/10.18653/v1/N19-1423.

  11. K. M. He, X. Y. Zhang, S. Q. Ren, J. Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 770–778, 2016. DOI: https://doi.org/10.1109/CVPR.2016.90.

    Google Scholar 

  12. S. Q. Ren, K. M. He, R. Girshick, J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of 2015 Annual Conference on Neural Information Processing Systems, Montreal, Canada, pp. 91–99, 2015.

  13. D. Qi, L. Su, J. Song, E. Cui, T. Bharti, A. Sacheti. Image-BERT: Cross-modal pre-training with large-scale weak-supervised image-text data, [Online], Available: https://arxiv.org/abs/2001.07966, January 23, 2020.

  14. J. S. Lu, D. Batra, D. Parikh, S. Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the 33rd Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 13–23, 2019.

  15. Y. C. Chen, L. J. Li, L. C. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, J. J. Liu. UNITER: UNiversal image-TExt representation learning. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 104–120, 2020. DOI: https://doi.org/10.1007/978-3-030-58577-8_7.

    Google Scholar 

  16. W. L. Hsiao, I. Katsman, C. Y. Wu, D. Parikh, K. Grauman. Fashion++: Minimal edits for outfit improvement. In Proceedings of IEEE/CVF International Conference On Computer Vision, IEEE, Montreal, Canada, pp. 5046–5055, 2019. DOI: https://doi.org/10.1109/ICCV.2019.00515.

    Google Scholar 

  17. M. I. Vasileva, B. A. Plummer, K. Dusad, S. Rajpal, R. Kumar, D. Forsyth. Learning type-aware embeddings for fashion compatibility. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 405–421, 2018. DOI: https://doi.org/10.1007/978-3-030-01270-0_24.

    Google Scholar 

  18. D. P. Fan, M C. Zhuge, L. Shao. Domain Specific Pre-Training of Cross Modality Transformer Model, US20220277218, September 2022.

  19. D. H. Gao, L. B. Jin, B. Chen, M. H. Qiu, P. Li, Y. Wei, Y. Hu, H. Wang. FashionBERT: Text and image matching with adaptive loss for cross-modal retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, pp. 2251–2260, 2020. DOI: https://doi.org/10.1145/3397271.3401430.

  20. M. C. Zhuge, D. H. Gao, D. P. Fan, L. B. Jin, B. Chen, H. M. Zhou, M. H. Qiu, L. Shao. Kaleido-BERT: Vision-language pre-training on fashion domain. In Proceedings of IEEE/CVF Conference on computer vision and pattern recognition, IEEE, Nashville, USA, pp. 12642–12652, 2021. DOI: https://doi.org/10.1109/CVPR46437.2021.01246.

    Google Scholar 

  21. W. H. Wang, E. Z. Xie, X. Li, D. P. Fan, K. T. Song, D. Liang, T. Lu, P. Luo, L. Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 548–558, 2021. DOI: https://doi.org/10.1109/ICCV48922.2021.00061.

    Google Scholar 

  22. X. W. Yang, H. M. Zhang, D. Jin, Y. R. Liu, C. H. Wu, J. C. Tan, D. L. Xie, J. Wang, X. Wang. Fashion captioning: Towards generating accurate descriptions with semantic rewards. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 1–17, 2020. DOI: https://doi.org/10.1007/978-3-030-58601-0_1.

    Google Scholar 

  23. Z. Al-Halah, K. Grauman. From Paris to Berlin: Discovering fashion style influences around the world. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 10133–10142, 2020. DOI: https://doi.org/10.1109/CVPR42600.2020.01015.

    Google Scholar 

  24. H. Tan, M. Bansal. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, pp. 5100–5111, 2019. DOI: https://doi.org/10.18653/v1/D19-1514.

  25. W. J. Su, X. Z. Zhu, Y. Cao, B. Li, L. W. Lu, F. R. Wei, J. F. Dai. VL-BERT: Pre-training of generic visual-linguistic representations. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020.

  26. K. H. Lee, X. Chen, G. Hua, H. D. Hu, X. D. He. Stacked cross attention for image-text matching. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 212–228, 2018. DOI: https://doi.org/10.1007/978-3-030-01225-0_13.

    Google Scholar 

  27. Z. X. Niu, M. Zhou, L. Wang, X. B. Gao, G. Hua. Hierarchical multimodal LSTM for dense visual-semantic embedding. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 1899–1907, 2017. DOI: https://doi.org/10.1109/ICCV.2017.208.

    Google Scholar 

  28. J. Xia, M. Zhuge, T. Geng, S. Fan, Y. Wei, Z. He, F. Zheng. Skating-mixer: Multimodal MLP for scoring figure skating, [Online], Available: https://arxiv.org/abs/2203.03990, 2022.

  29. X. J. Li, X. Yin, C. Y. Li, P. C. Zhang, X. W. Hu, L. Zhang, L. J. Wang, H. D. Hu, L. Dong, F. R. Wei, Y. J. Choi, J. F. Gao. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 121–137, 2020. DOI: https://doi.org/10.1007/978-3-030-58577-8_8.

    Google Scholar 

  30. M. C. Zhuge, D. P. Fan, N. Liu, D. W. Zhang, D. Xu, L. Shao. Salient object detection via integrity learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, to be published. DOI: https://doi.org/10.1109/TPAMI.2022.3179526.

  31. K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, Y. Bengio. Show, attend and h]Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, pp. 2048–2057, 2015.

  32. T. Arici, M. S. Seyfioglu, T. Neiman, Y. Xu, S. Train, T. Chilimbi, B. Zeng, I. Tutar. MLIM: Vision-and-language model pre-training with masked language and image modeling, [Online], Available: https://arxiv.org/abs/2109.12178, September 24, 2021.

  33. H. B. Bao, L. Dong, S. L. Piao, F. R. Wei. BEiT: BERT pre-training of image transformers. In Proceedings of the 10th International Conference on Learning Representations, 2022.

  34. K. M. He, X. L. Chen, S. N. Xie, Y. H. Li, P. Dollár, R. Girshick. Masked autoencoders are scalable vision learners. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 15979–15988, 2022. DOI: https://doi.org/10.1109/CVPR52688.2022.01553.

    Google Scholar 

  35. Z. C. Huang, Z. Y. Zeng, B. Liu, D. M. Fu, J. L. Fu. Pixel-BERT: Aligning image pixels with text by deep multi-modal transformers, [Online], Available: https://arxiv.org/abs/2004.00849, June 22, 2020.

  36. X. D. Lin, G. Bertasius, J. Wang, S. F. Chang, D. Parikh, L. Torresani. VX2TEXT: End-to-end learning of video-based text generation from multimodal inputs. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 7001–7011, 2021. DOI: https://doi.org/10.1109/CVPR46437.2021.00693.

    Google Scholar 

  37. W. Kim, B. Son, I. Kim. ViLT: Vision-and-language transformer without convolution or region supervision. In Proceedings of the 38th International Conference on Machine Learning, pp. 5583–5594, 2021.

  38. M. Yan, H. Y. Xu, C. L. Li, B. Bi, J. F. Tian, M. Gui, W. Wang. Grid-VLP: Revisiting grid features for vision-language pre-training, [Online], Available: https://arxiv.org/abs/2108.09479, August 21, 2021.

  39. Z. C. Huang, Z. Y. Zeng, Y. P. Huang, B. Liu, D. M. Fu, J. L. Fu. Seeing out of the box: End-to-end pre-training for vision-language representation learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 12971–12980, 2021. DOI: https://doi.org/10.1109/CVPR46437.2021.01278.

    Google Scholar 

  40. S. Goenka, Z. H. Zheng, A. Jaiswal, R. Chada, Y. Wu, V. Hedau, P. Natarajan. Fashionvlp: Vision language transformer for fashion retrieval with feedback. In Proceedings on Conference on computer vision and pattern recognition, IEEE, New Orleans, USA, pp. 14085–14095, 2022. DOI: https://doi.org/10.1109/CVPR52688.2022.01371.

    Google Scholar 

  41. J. Lei, L. J. Li, L. W. Zhou, Z. Gan, T. L. Berg, M. Bansal, J. J. Liu. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 7327–7337, 2021. DOI: https://doi.org/10.1109/CVPR46437.2021.00725.

    Google Scholar 

  42. H. Y. Xu, M. Yan, C. L. Li, B. Bi, S. F. Huang, W. M. Xiao, F. Huang. E2E-VLP: End-to-end vision-language pre-training enhanced by visual learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 503–513, 2021.

  43. H. Akbari, L. Z. Yuan, R. Qian, W. H. Chuang, S. F. Chang, Y. Cui, B. Q. Gong. VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems, pp. 24206–24221, 2021.

  44. X. Y. Yi, J. Yang, L. C. Hong, D. Z. Cheng, L. Heldt, A. Kumthekar, Z. Zhao, L. Wei, E. Chi. Sampling-bias-corrected neural modeling for large corpus item recommendations. In Proceedings of the 13th ACM Conference on Recommender Systems, ACM, Copenhagen, Denmark, pp. 269–277, 2019. DOI: https://doi.org/10.1145/3298689.3346996.

    Chapter  Google Scholar 

  45. O. Ronneberger, P. Fischer, T. Brox. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, Munich, Germany, pp. 234–241, 2015. DOI: https://doi.org/10.1007/978-3-319-24574-4_28.

    Google Scholar 

  46. C. Alberti, J. Ling, M. Collins, D. Reitter. Fusion of detected objects in text for visual question answering. In Proceedings of Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, pp. 2131–2140, 2019. DOI: https://doi.org/10.18653/v1/D19-1219.

  47. N. Rostamzadeh, S. Hosseini, T. Boquet, W. Stokowiec, Y. Zhang, C. Jauvin, C. Pal. Fashion-gen: The generative fashion dataset and challenge, [Online], Available: https://arxiv.org/abs/1806.08317v1, July 30, 2018.

  48. R. Kiros, R. Salakhutdinov, R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models, [Online], Available: https://arxiv.org/abs/1411.2539, 2014.

  49. F. Faghri, D. J. Fleet, J. R. Kiros, S. Fidler. VSE++: Improving visual-semantic embeddings with hard negatives. In Proceedings of British Machine Vision Conference, Newcastle, UK, 2018.

  50. Y. X. Wang, H. Yang, X. M. Qian, L. Ma, J. Lu, B. Li, X. Fan. Position focused attention network for image-text matching. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, China, pp. 3792–3798, 2019.

  51. J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, F. F. Li. ImageNet: A large-scale hierarchical image database. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Miami, USA, pp. 248–255, 2009. DOI: https://doi.org/10.1109/CVPR.2009.5206848.

    Google Scholar 

  52. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pp. 8748–8763, 2021.

  53. T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick. Microsoft COCO: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision, Springer, Zürich, Switzerland, pp. 740–755, 2014. DOI: https://doi.org/10.1007/978-3-319-10602-1_48.

    Google Scholar 

  54. G. Li, N. Duan, Y. J. Fang, M. Gong, D. Jiang. Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, the 32nd Innovative Applications of Artificial Intelligence Conference, the 10th AAAI Symposium on Educational Advances in Artificial Intelligence, New York, USA, pp. 11336–11344, 2020.

  55. L. Wu, D. Y. Liu, X. J. Guo, R. C. Hong, L. C. Liu, R. Zhang. Multi-scale spatial representation learning via recursive hermite polynomial networks. In Proceedings of the 31st International Joint Conference on Artificial Intelligence, Vienna, Austria, pp. 1465–1473, 2022. DOI: https://doi.org/10.24963/ijcai.2022/204.

  56. D. P. Chen, M. Wang, H. B. Chen, L. Wu, J. Qin, W. Peng. Cross-modal retrieval with heterogeneous graph embedding. In Proceedings of the 30th ACM International Conference on Multimedia, ACM, Lisboa, Portugal, pp. 3291–3300, 2022. DOI: https://doi.org/10.1145/3503161.3548195.

    Chapter  Google Scholar 

  57. D. Y. Liu, L. Wu, F. Zheng, L. Q. Liu, M. Wang. Verbal-person nets: Pose-guided multi-granularity language-to-person generation. IEEE Transactions on Neural Networks and Learning Systems, to be published. DOI: https://doi.org/10.1109/TNNLS.2022.3151631.

  58. Z. Zhang, H. Y. Luo, L. Zhu, G. M. Lu, H. T. Shen. Modality-invariant asymmetric networks for cross-modal hashing. IEEE Transactions on Knowledge and Data Engineering, to be published. DOI: https://doi.org/10.1109/TKDE.2022.3144352.

Download references

Acknowledgements

This work is funded by Toyota Motor Europe via the research project TRACE-Zürich. The authors also would like to thank the anonymous reviewers and editor for their helpful comments on this manuscript.

Author information

Author notes
  1. These aughors contribute equally to this paper

Authors and Affiliations

  1. International Core Business Unit, Alibaba Group, Hangzhou, 310051, China

    Ge-Peng Ji, Mingchen Zhuge & Dehong Gao

  2. Computer Vision Lab, ETH Zürich, Zürich, 8092, Switzerland

    Deng-Ping Fan, Christos Sakaridis & Luc Van Gool

Authors
  1. Ge-Peng Ji
    View author publications

    You can also search for this author in PubMed Google Scholar

  2. Mingchen Zhuge
    View author publications

    You can also search for this author in PubMed Google Scholar

  3. Dehong Gao
    View author publications

    You can also search for this author in PubMed Google Scholar

  4. Deng-Ping Fan
    View author publications

    You can also search for this author in PubMed Google Scholar

  5. Christos Sakaridis
    View author publications

    You can also search for this author in PubMed Google Scholar

  6. Luc Van Gool
    View author publications

    You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Deng-Ping Fan.

Additional information

Work was done while Ge-Peng Ji was a research intern at Alibaba Group.

Conflicts of Interests

The authors declared that they have no conflicts of interest in this work. We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Ge-Peng Ji received the M. Sc. degree in communication and information systems from Wuhan University, China in 2021. He is currently a Ph.D. degree candidate at Australian National University, supervised by Professor Nick Barnes, majoring in engineering and computer science. He has published about 10 peer-reviewed journal and conference papers. In 2021, he received the Student Travel Award from Medical Image Computing and Computer-assisted Intervention Society.

His research interests lie in computer vision, especially in a variety of dense prediction tasks, such as video analysis, medical image segmentation, camouflaged object segmentation, and saliency detection.

E-mail: gepengai.ji@gmail.com

ORCID iD: 0000-0001-7092-2877

Mingchen Zhuge received the M. Sc. degree in computer science from China University of Geosciences, China in 2021. He is a Ph. D. degree candidate in King Abdullah University of Science and Technology (KAUST) under the supervision of Prof. Juergen Schmidhuber. In 2019, he won the championship in the ZTE algorithm competition. He has worked as an intern at Alibaba Group and IIAI, as well as a visiting scholar at SUSTech. Besides, he has been invited to serve as a top conference reviewer for CVPR, ICML, ECCV, NeurIPS, etc.

His research interests include multi-modal learning and reinforcement learning.

E-mail: mczhuge@gmail.com

ORCID iD: 0000-0003-2561-7712

Dehong Gao received the Ph. D. degree from The Hong Kong Polytechnic University, China in 2014. He is now working as an associate professor in Northwestern Polytechnical University, China.

His research interests include information retrieval, recommendation, natural language processing and machine learning.

E-mail: gaodehong_polyu@163.com, dehong.gdh@alibaba-inc.com

ORCID iD: 0000-0002-6636-5702

Deng-Ping Fan received the Ph. D. degree from the Nankai University, China in 2019. He joined Inception Institute of Artificial Intelligence (IIAI), UAE in 2019. He has published about 50 top journal and conference papers such as TPAMI, IJCV, TIP, TNNLS, TMI, CVPR, ICCV, ECCV, IJCAI, etc. He won the Best Paper Finalist Award at IEEE CVPR 2019, the Best Paper Award Nominee at IEEE CVPR 2020. He was recognized as the CVPR 2019 outstanding reviewer with a special mention award, the CVPR 2020 outstanding reviewer, the ECCV 2020 high-quality reviewer, and the CVPR 2021 outstanding reviewer. He served as a program committee board (PCB) member of IJCAI 2022–2024, a senior program committee (SPC) member of IJCAI 2021, a program committee members (PC) of CAD&CG 2021, a committee member of China Society of Image and Graphics (CSIG), area chair in NeurIPS 2021 Datasets and Benchmarks Track, area chair in MICCAI2020 Workshop.

His research interests include computer vision, deep learning, and visual attention, especially the human vision on co-salient

object detection, RGB salient object detection, RGB-D salient object detection, and video salient object detection.

E-mail: dengpfan@gmail.com (Corresponding author)

ORCID iD: 0000-0002-5245-7518

Christos Sakaridis received the M. Sc. degree in computer science from ETH Zürich, Switzerland in 2016 and his Diploma in electrical and computer engineering from the National Technical University of Athens, Greece in 2014, conducting his Diploma thesis at CVSP Group under the supervision of Prof. Petros Maragos. He received the Ph.D. degree in electrical engineering and information technology from ETH Zürich, Switzerland in 2021, working at Computer Vision Lab and supervised by Prof. Luc Van Gool. He is a postdoctoral researcher at Computer Vision Lab, ETH Zürich, Switzerland. Since 2021, he is the Principal Engineer in TRACE-Zürich, a project on computer vision for autonomous cars running at Computer Vision Lab and funded by Toyota Motor Europe. Moreover, he is the team leader in the EFCL project Sensor Fusion, in which they develop adaptive sensor fusion architectures for high-level visual perception.

His broad research fields are computer vision and machine learning. The focus of his research is on high-level visual perception, involving adverse visual conditions, domain adaptation, semantic segmentation, depth estimation, object detection, synthetic data generation, and fusion of multiple sensors (including lidar, radar and event cameras, with emphasis on their application to autonomous cars and robots).

E-mail: csakarid@vision.ee.ethz.ch

ORCID iD: 0000-0003-1127-8887

Luc Van Gool received the Ph. D. degree in electromechanical engineering at Katholieke Universiteit Leuven, Belgium in 1981. Currently, he is a professor at Katholieke Universiteit Leuven, in Belgium and ETH Zürich, Switzerland. He leads computer vision research at both places and also teaches at both. He has been a program committee member of several major computer vision conferences. He received several Best Paper awards, won a David Marr Prize and a Koenderink Award, and was nominated Distinguished Researcher by the IEEE Computer Science committee. He is a co-founder of 10 spin-off companies.

His interests include 3D reconstruction and modeling, object recognition, tracking, gesture analysis, and a combination of those.

E-mail: vangool@vision.ee.ethz.ch

ORCID iD: 0000-0002-3445-5711

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit https://doi.org/creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ji, GP., Zhuge, M., Gao, D. et al. Masked Vision-language Transformer in Fashion. Mach. Intell. Res. (2023). https://doi.org/10.1007/s11633-022-1394-4

Download citation

  • Received: 24 May 2022

  • Accepted: 14 October 2022

  • Published: 27 February 2023

  • DOI: https://doi.org/10.1007/s11633-022-1394-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Keywords

  • Vision-language
  • masked image reconstruction
  • transformer
  • fashion
  • e-commercial
Download PDF

Working on a manuscript?

Avoid the most common mistakes and prepare your manuscript for journal editors.

Learn more

Advertisement

Over 10 million scientific documents at your fingertips

Switch Edition
  • Academic Edition
  • Corporate Edition
  • Home
  • Impressum
  • Legal information
  • Privacy statement
  • California Privacy Statement
  • How we use cookies
  • Manage cookies/Do not sell my data
  • Accessibility
  • FAQ
  • Contact us
  • Affiliate program

Not affiliated

Springer Nature

© 2023 Springer Nature Switzerland AG. Part of Springer Nature.