We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize the vision transformer architecture for replacing the bidirectional encoder representations from Transformers (BERT) in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image reconstruction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen 2018 winner, Kaleido-BERT. The code is available at https://github.com/GewelsJI/MVLT.
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. H. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations, 2021.
Z. Liu, Y. T. Lin, Y. Cao, H. Hu, Y. X. Wei, Z. Zhang, S. Lin, B. N. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 9992–10002, 2021. DOI: https://doi.org/10.1109/ICCV48922.2021.00986.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, US, pp. 6000–6010, 2017.
T. X. Sun, X. Y. Liu, X. P. Qiu, X. J. Huang. Paradigm shift in natural language processing. Machine Intelligence Research, vol. 19, no. 3, pp. 169–183, 2022. DOI: https://doi.org/10.1007/s11633-022-1331-6.
S. Agarwal, G. Krueger, J. Clark, A. Radford, J. W. Kim, M. Brundage. Evaluating CLIP: Towards characterization of broader capabilities and downstream implications, [Online], Available: https://arxiv.org/abs/2108.02818, August 05, 2021.
M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, I. Sutskever. Generative pretraining from pixels. In Proceedings of the 37th International Conference on Machine Learning, Article number 233, 2020.
J. Y. Lin, R. Men, A. Yang, C. Zhou, Y. C. Zhang, P. Wang, J. R. Zhou, J. Tang, H. X. Yang. M6: Multi-modality-to-multi-modality multitask mega-transformer for unified pretraining. In Proceedings of the 27th ACM SIGK-DD Conference on Knowledge Discovery & Data Mining, Association for Computing Machinery, 2021. DOI: https://doi.org/10.1145/3447548.3467206.
A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, I. Sutskever. Zero-shot text-to-image generation. In Proceedings of the 38th International Conference on Machine Learning, pp. 8821–8831, 2021.
H. Wu, Y. P. Gao, X. X. Guo, Z. Al-Halah, S. Rennie, K. Grauman, R. Feris. Fashion IQ: A new dataset towards retrieving images by natural language feedback. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 11302–11312, 2021. DOI: https://doi.org/10.1109/CVPR46437.2021.01115.
J. Devlin, M. W. Chang, K. Lee, K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, USA, pp.4171–4186, 2019. DOI: https://doi.org/10.18653/v1/N19-1423.
K. M. He, X. Y. Zhang, S. Q. Ren, J. Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 770–778, 2016. DOI: https://doi.org/10.1109/CVPR.2016.90.
S. Q. Ren, K. M. He, R. Girshick, J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of 2015 Annual Conference on Neural Information Processing Systems, Montreal, Canada, pp. 91–99, 2015.
D. Qi, L. Su, J. Song, E. Cui, T. Bharti, A. Sacheti. Image-BERT: Cross-modal pre-training with large-scale weak-supervised image-text data, [Online], Available: https://arxiv.org/abs/2001.07966, January 23, 2020.
J. S. Lu, D. Batra, D. Parikh, S. Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the 33rd Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 13–23, 2019.
Y. C. Chen, L. J. Li, L. C. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, J. J. Liu. UNITER: UNiversal image-TExt representation learning. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 104–120, 2020. DOI: https://doi.org/10.1007/978-3-030-58577-8_7.
W. L. Hsiao, I. Katsman, C. Y. Wu, D. Parikh, K. Grauman. Fashion++: Minimal edits for outfit improvement. In Proceedings of IEEE/CVF International Conference On Computer Vision, IEEE, Montreal, Canada, pp. 5046–5055, 2019. DOI: https://doi.org/10.1109/ICCV.2019.00515.
M. I. Vasileva, B. A. Plummer, K. Dusad, S. Rajpal, R. Kumar, D. Forsyth. Learning type-aware embeddings for fashion compatibility. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 405–421, 2018. DOI: https://doi.org/10.1007/978-3-030-01270-0_24.
D. P. Fan, M C. Zhuge, L. Shao. Domain Specific Pre-Training of Cross Modality Transformer Model, US20220277218, September 2022.
D. H. Gao, L. B. Jin, B. Chen, M. H. Qiu, P. Li, Y. Wei, Y. Hu, H. Wang. FashionBERT: Text and image matching with adaptive loss for cross-modal retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, pp. 2251–2260, 2020. DOI: https://doi.org/10.1145/3397271.3401430.
M. C. Zhuge, D. H. Gao, D. P. Fan, L. B. Jin, B. Chen, H. M. Zhou, M. H. Qiu, L. Shao. Kaleido-BERT: Vision-language pre-training on fashion domain. In Proceedings of IEEE/CVF Conference on computer vision and pattern recognition, IEEE, Nashville, USA, pp. 12642–12652, 2021. DOI: https://doi.org/10.1109/CVPR46437.2021.01246.
W. H. Wang, E. Z. Xie, X. Li, D. P. Fan, K. T. Song, D. Liang, T. Lu, P. Luo, L. Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 548–558, 2021. DOI: https://doi.org/10.1109/ICCV48922.2021.00061.
X. W. Yang, H. M. Zhang, D. Jin, Y. R. Liu, C. H. Wu, J. C. Tan, D. L. Xie, J. Wang, X. Wang. Fashion captioning: Towards generating accurate descriptions with semantic rewards. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 1–17, 2020. DOI: https://doi.org/10.1007/978-3-030-58601-0_1.
Z. Al-Halah, K. Grauman. From Paris to Berlin: Discovering fashion style influences around the world. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 10133–10142, 2020. DOI: https://doi.org/10.1109/CVPR42600.2020.01015.
H. Tan, M. Bansal. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, pp. 5100–5111, 2019. DOI: https://doi.org/10.18653/v1/D19-1514.
W. J. Su, X. Z. Zhu, Y. Cao, B. Li, L. W. Lu, F. R. Wei, J. F. Dai. VL-BERT: Pre-training of generic visual-linguistic representations. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020.
K. H. Lee, X. Chen, G. Hua, H. D. Hu, X. D. He. Stacked cross attention for image-text matching. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 212–228, 2018. DOI: https://doi.org/10.1007/978-3-030-01225-0_13.
Z. X. Niu, M. Zhou, L. Wang, X. B. Gao, G. Hua. Hierarchical multimodal LSTM for dense visual-semantic embedding. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 1899–1907, 2017. DOI: https://doi.org/10.1109/ICCV.2017.208.
J. Xia, M. Zhuge, T. Geng, S. Fan, Y. Wei, Z. He, F. Zheng. Skating-mixer: Multimodal MLP for scoring figure skating, [Online], Available: https://arxiv.org/abs/2203.03990, 2022.
X. J. Li, X. Yin, C. Y. Li, P. C. Zhang, X. W. Hu, L. Zhang, L. J. Wang, H. D. Hu, L. Dong, F. R. Wei, Y. J. Choi, J. F. Gao. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 121–137, 2020. DOI: https://doi.org/10.1007/978-3-030-58577-8_8.
M. C. Zhuge, D. P. Fan, N. Liu, D. W. Zhang, D. Xu, L. Shao. Salient object detection via integrity learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, to be published. DOI: https://doi.org/10.1109/TPAMI.2022.3179526.
K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, Y. Bengio. Show, attend and h]Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, pp. 2048–2057, 2015.
T. Arici, M. S. Seyfioglu, T. Neiman, Y. Xu, S. Train, T. Chilimbi, B. Zeng, I. Tutar. MLIM: Vision-and-language model pre-training with masked language and image modeling, [Online], Available: https://arxiv.org/abs/2109.12178, September 24, 2021.
H. B. Bao, L. Dong, S. L. Piao, F. R. Wei. BEiT: BERT pre-training of image transformers. In Proceedings of the 10th International Conference on Learning Representations, 2022.
K. M. He, X. L. Chen, S. N. Xie, Y. H. Li, P. Dollár, R. Girshick. Masked autoencoders are scalable vision learners. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 15979–15988, 2022. DOI: https://doi.org/10.1109/CVPR52688.2022.01553.
Z. C. Huang, Z. Y. Zeng, B. Liu, D. M. Fu, J. L. Fu. Pixel-BERT: Aligning image pixels with text by deep multi-modal transformers, [Online], Available: https://arxiv.org/abs/2004.00849, June 22, 2020.
X. D. Lin, G. Bertasius, J. Wang, S. F. Chang, D. Parikh, L. Torresani. VX2TEXT: End-to-end learning of video-based text generation from multimodal inputs. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 7001–7011, 2021. DOI: https://doi.org/10.1109/CVPR46437.2021.00693.
W. Kim, B. Son, I. Kim. ViLT: Vision-and-language transformer without convolution or region supervision. In Proceedings of the 38th International Conference on Machine Learning, pp. 5583–5594, 2021.
M. Yan, H. Y. Xu, C. L. Li, B. Bi, J. F. Tian, M. Gui, W. Wang. Grid-VLP: Revisiting grid features for vision-language pre-training, [Online], Available: https://arxiv.org/abs/2108.09479, August 21, 2021.
Z. C. Huang, Z. Y. Zeng, Y. P. Huang, B. Liu, D. M. Fu, J. L. Fu. Seeing out of the box: End-to-end pre-training for vision-language representation learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 12971–12980, 2021. DOI: https://doi.org/10.1109/CVPR46437.2021.01278.
S. Goenka, Z. H. Zheng, A. Jaiswal, R. Chada, Y. Wu, V. Hedau, P. Natarajan. Fashionvlp: Vision language transformer for fashion retrieval with feedback. In Proceedings on Conference on computer vision and pattern recognition, IEEE, New Orleans, USA, pp. 14085–14095, 2022. DOI: https://doi.org/10.1109/CVPR52688.2022.01371.
J. Lei, L. J. Li, L. W. Zhou, Z. Gan, T. L. Berg, M. Bansal, J. J. Liu. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 7327–7337, 2021. DOI: https://doi.org/10.1109/CVPR46437.2021.00725.
H. Y. Xu, M. Yan, C. L. Li, B. Bi, S. F. Huang, W. M. Xiao, F. Huang. E2E-VLP: End-to-end vision-language pre-training enhanced by visual learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 503–513, 2021.
H. Akbari, L. Z. Yuan, R. Qian, W. H. Chuang, S. F. Chang, Y. Cui, B. Q. Gong. VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems, pp. 24206–24221, 2021.
X. Y. Yi, J. Yang, L. C. Hong, D. Z. Cheng, L. Heldt, A. Kumthekar, Z. Zhao, L. Wei, E. Chi. Sampling-bias-corrected neural modeling for large corpus item recommendations. In Proceedings of the 13th ACM Conference on Recommender Systems, ACM, Copenhagen, Denmark, pp. 269–277, 2019. DOI: https://doi.org/10.1145/3298689.3346996.
O. Ronneberger, P. Fischer, T. Brox. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, Munich, Germany, pp. 234–241, 2015. DOI: https://doi.org/10.1007/978-3-319-24574-4_28.
C. Alberti, J. Ling, M. Collins, D. Reitter. Fusion of detected objects in text for visual question answering. In Proceedings of Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, pp. 2131–2140, 2019. DOI: https://doi.org/10.18653/v1/D19-1219.
N. Rostamzadeh, S. Hosseini, T. Boquet, W. Stokowiec, Y. Zhang, C. Jauvin, C. Pal. Fashion-gen: The generative fashion dataset and challenge, [Online], Available: https://arxiv.org/abs/1806.08317v1, July 30, 2018.
R. Kiros, R. Salakhutdinov, R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models, [Online], Available: https://arxiv.org/abs/1411.2539, 2014.
F. Faghri, D. J. Fleet, J. R. Kiros, S. Fidler. VSE++: Improving visual-semantic embeddings with hard negatives. In Proceedings of British Machine Vision Conference, Newcastle, UK, 2018.
Y. X. Wang, H. Yang, X. M. Qian, L. Ma, J. Lu, B. Li, X. Fan. Position focused attention network for image-text matching. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, China, pp. 3792–3798, 2019.
J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, F. F. Li. ImageNet: A large-scale hierarchical image database. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Miami, USA, pp. 248–255, 2009. DOI: https://doi.org/10.1109/CVPR.2009.5206848.
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pp. 8748–8763, 2021.
T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick. Microsoft COCO: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision, Springer, Zürich, Switzerland, pp. 740–755, 2014. DOI: https://doi.org/10.1007/978-3-319-10602-1_48.
G. Li, N. Duan, Y. J. Fang, M. Gong, D. Jiang. Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, the 32nd Innovative Applications of Artificial Intelligence Conference, the 10th AAAI Symposium on Educational Advances in Artificial Intelligence, New York, USA, pp. 11336–11344, 2020.
L. Wu, D. Y. Liu, X. J. Guo, R. C. Hong, L. C. Liu, R. Zhang. Multi-scale spatial representation learning via recursive hermite polynomial networks. In Proceedings of the 31st International Joint Conference on Artificial Intelligence, Vienna, Austria, pp. 1465–1473, 2022. DOI: https://doi.org/10.24963/ijcai.2022/204.
D. P. Chen, M. Wang, H. B. Chen, L. Wu, J. Qin, W. Peng. Cross-modal retrieval with heterogeneous graph embedding. In Proceedings of the 30th ACM International Conference on Multimedia, ACM, Lisboa, Portugal, pp. 3291–3300, 2022. DOI: https://doi.org/10.1145/3503161.3548195.
D. Y. Liu, L. Wu, F. Zheng, L. Q. Liu, M. Wang. Verbal-person nets: Pose-guided multi-granularity language-to-person generation. IEEE Transactions on Neural Networks and Learning Systems, to be published. DOI: https://doi.org/10.1109/TNNLS.2022.3151631.
Z. Zhang, H. Y. Luo, L. Zhu, G. M. Lu, H. T. Shen. Modality-invariant asymmetric networks for cross-modal hashing. IEEE Transactions on Knowledge and Data Engineering, to be published. DOI: https://doi.org/10.1109/TKDE.2022.3144352.
This work is funded by Toyota Motor Europe via the research project TRACE-Zürich. The authors also would like to thank the anonymous reviewers and editor for their helpful comments on this manuscript.
Conflicts of Interests
The authors declared that they have no conflicts of interest in this work. We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.
Work was done while Ge-Peng Ji was a research intern at Alibaba Group.
Ge-Peng Ji received the M. Sc. degree in communication and information systems from Wuhan University, China in 2021. He is currently a Ph.D. degree candidate at Australian National University, supervised by Professor Nick Barnes, majoring in engineering and computer science. He has published about 10 peer-reviewed journal and conference papers. In 2021, he received the Student Travel Award from Medical Image Computing and Computer-assisted Intervention Society.
His research interests lie in computer vision, especially in a variety of dense prediction tasks, such as video analysis, medical image segmentation, camouflaged object segmentation, and saliency detection.
Mingchen Zhuge received the M. Sc. degree in computer science from China University of Geosciences, China in 2021. He is a Ph. D. degree candidate in King Abdullah University of Science and Technology (KAUST) under the supervision of Prof. Juergen Schmidhuber. In 2019, he won the championship in the ZTE algorithm competition. He has worked as an intern at Alibaba Group and IIAI, as well as a visiting scholar at SUSTech. Besides, he has been invited to serve as a top conference reviewer for CVPR, ICML, ECCV, NeurIPS, etc.
His research interests include multi-modal learning and reinforcement learning.
Dehong Gao received the Ph. D. degree from The Hong Kong Polytechnic University, China in 2014. He is now working as an associate professor in Northwestern Polytechnical University, China.
His research interests include information retrieval, recommendation, natural language processing and machine learning.
Deng-Ping Fan received the Ph. D. degree from the Nankai University, China in 2019. He joined Inception Institute of Artificial Intelligence (IIAI), UAE in 2019. He has published about 50 top journal and conference papers such as TPAMI, IJCV, TIP, TNNLS, TMI, CVPR, ICCV, ECCV, IJCAI, etc. He won the Best Paper Finalist Award at IEEE CVPR 2019, the Best Paper Award Nominee at IEEE CVPR 2020. He was recognized as the CVPR 2019 outstanding reviewer with a special mention award, the CVPR 2020 outstanding reviewer, the ECCV 2020 high-quality reviewer, and the CVPR 2021 outstanding reviewer. He served as a program committee board (PCB) member of IJCAI 2022–2024, a senior program committee (SPC) member of IJCAI 2021, a program committee members (PC) of CAD&CG 2021, a committee member of China Society of Image and Graphics (CSIG), area chair in NeurIPS 2021 Datasets and Benchmarks Track, area chair in MICCAI2020 Workshop.
His research interests include computer vision, deep learning, and visual attention, especially the human vision on co-salient
object detection, RGB salient object detection, RGB-D salient object detection, and video salient object detection.
Christos Sakaridis received the M. Sc. degree in computer science from ETH Zürich, Switzerland in 2016 and his Diploma in electrical and computer engineering from the National Technical University of Athens, Greece in 2014, conducting his Diploma thesis at CVSP Group under the supervision of Prof. Petros Maragos. He received the Ph.D. degree in electrical engineering and information technology from ETH Zürich, Switzerland in 2021, working at Computer Vision Lab and supervised by Prof. Luc Van Gool. He is a postdoctoral researcher at Computer Vision Lab, ETH Zürich, Switzerland. Since 2021, he is the Principal Engineer in TRACE-Zürich, a project on computer vision for autonomous cars running at Computer Vision Lab and funded by Toyota Motor Europe. Moreover, he is the team leader in the EFCL project Sensor Fusion, in which they develop adaptive sensor fusion architectures for high-level visual perception.
His broad research fields are computer vision and machine learning. The focus of his research is on high-level visual perception, involving adverse visual conditions, domain adaptation, semantic segmentation, depth estimation, object detection, synthetic data generation, and fusion of multiple sensors (including lidar, radar and event cameras, with emphasis on their application to autonomous cars and robots).
Luc Van Gool received the Ph. D. degree in electromechanical engineering at Katholieke Universiteit Leuven, Belgium in 1981. Currently, he is a professor at Katholieke Universiteit Leuven, in Belgium and ETH Zürich, Switzerland. He leads computer vision research at both places and also teaches at both. He has been a program committee member of several major computer vision conferences. He received several Best Paper awards, won a David Marr Prize and a Koenderink Award, and was nominated Distinguished Researcher by the IEEE Computer Science committee. He is a co-founder of 10 spin-off companies.
His interests include 3D reconstruction and modeling, object recognition, tracking, gesture analysis, and a combination of those.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit https://doi.org/creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ji, GP., Zhuge, M., Gao, D. et al. Masked Vision-language Transformer in Fashion. Mach. Intell. Res. 20, 421–434 (2023). https://doi.org/10.1007/s11633-022-1394-4
- masked image reconstruction