Masked Vision-language Transformer in Fashion

We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize the vision transformer architecture for replacing the bidirectional encoder representations from Transformers (BERT) in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image reconstruction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen 2018 winner, Kaleido-BERT. The code is available at https://github.com/GewelsJI/MVLT.


Introduction
The emergence of transformer is drawing enormous attention from the academic community, facilitating the advancement of computer vision (CV) [3,4] and natural language processing (NLP) [5,6].Benefiting from the robustness of transformers, researchers also contribute to the vision-language (VL) field [7][8][9][10][11] with zeal.To better utilize the pre-trained models in CV and NLP, existing general VL models are mainly based on the BERT model [12] or adopt the well-pretrained vision extractors [13,14] or both.However, general VL methods [15][16][17]  Fig. 1 Different visual reconstruction tasks for VL pretraining [1,2] utilize masked image modeling (top) with the random masking strategy (i.e., to use M padding to replace raw vectors), which reconstructs pre-extracted visual semantics (i.e., probabilities) at the feature-level.We introduce a generative task named masked image reconstruction (bottom), which directly reconstructs image patches at the pixel level.
a) Insufficient Granularity.Unlike the general objects with complex backgrounds, only focusing on coarse-grained semantics is insufficient for a fashion product [18][19][20], as it would lead the network to generate sub-optimal results.Contrarily, the fashion-oriented framework requires more fine-grained representations, such as a suit with different materials (e.g., wool, linen, and cotton) or collars (e.g., band, camp, and windsor).b) Bad Transferability.The pre-extracted visual features are not discriminative for fashion-oriented tasks, restricting the cross-modal representations.
To address the above issues, we present a novel VL framework, termed masked visionlanguage transformer (MVLT).Specifically, we introduce a generative task, masked image reconstruction (MIR), for the fashion-based VL framework.Compared to previous pre-training tasks, such as masked image modeling (regression task) or masked image classification (classification task), MIR enables the network to learn more fine-grained representations via pixel-level visual knowledge (see Fig. 1).Further, inspired by pyramid vision transformer (PVT) [21], we utilize a pyramid architecture for our VL transformer.Then, we introduce the MIR task.These two improvements significantly enhance the ability to adapt to fashion-specific understanding and generative tasks, and can conduct in an end-to-end manner.To this end, MVLT can directly process the raw multi-modal inputs in dense formats (i.e., linguistic tokens and visual patches) without extra (e.g., ResNet) pre-processing models [22,23].Our main contributions are summarized as follows: • We introduce a novel masked image reconstruction (MIR) task, which is the first real pixel-level generative strategy utilized in VL pre-training.• Based on the MIR task, we present an endto-end VL framework, called MVLT, for the fashion domain, greatly promoting the transferability to the downstream tasks and large-scale web applications.
• Extensive experiments show that MVLT significantly outperforms the state-of-the-art models on matching and generative tasks.

Background
In recent years, BERT-based pre-training models have been widely investigated in VL tasks.Many previous attempts, such as LXMERT [24], VL-BERT [25], and FashionBERT [1], were successful in a wide range of downstream applications.
Experiments and discussions show that BERT is a powerful method for learning multi-modal representations, outperforming several previous CNNbased [26] or LSTM-based [27,28] approaches.
Compared to previous studies, this paper aims to develop a more efficient self-supervised objective that can be easily implemented in pre-training and provides better representations for real-world applications.Thus, we review research on masked learning strategies and end-to-end multi-modal schemes that inspired us the most.

Masked Learning Strategies
Masked modeling is the vital self-supervised task in BERT [12] and initially demonstrates outstanding abilities in natural language processing.Researchers have replicated its strength in language models because of its utility in multi-modal and vision tasks.Most VL works [16,25,29] transfer masked modeling into visual tokens and use a regression task to construct the token feature from nonsense-replace or a classification task to predict the token's attribute.To reduce the difficulty in learning, Kaleido-BERT [2] optimizes masked modeling by employing a Kaleido strategy that facilitates coherent learning for multi-grained semantics.Although this work improves the performance of VL-related tasks in fashion indeed, we argue that the token-patch pre-alignment scheme by using auxiliary tool [30,31] is still complex and impedes the application to practical settings.
Another work [32] introduces the MLIM approach that strengthens the masked image modeling with an image reconstruction task, which shares a similar idea to ours.However, our experiments showed that requiring a model to reconstruct the entire image without any reminder is too difficult.Recently, BEiT [33] and MAE [34] utilize a BERT-style pre-training as part of the visual learner, and they discover that models are effective at learning semantics with such a scheme.These two works strengthen our conviction that converting the original masked image modeling (i.e., a

Masked Vision Task (Reconstruction)
Fig. 2 Comparison of MVLT to cutting-edge fashion-oriented VL frameworks.FashionBERT (a) utilizes a language-based encoder (i.e., BERT) to extract VL representations with single-scale visual input (i.e., image patches).Kaleido-BERT (b) extends it with two upgrades: adds five fixed-scale inputs (i.e., Kaleido patches) to acquire hierarchical visual features and designs Kaleido vision tasks to fully learn VL representations.However, the visual embedding of these models is frozen (i.e., without parameter updating), thus, a lack of domain-specific visual knowledge severely hinders their transferability.Differently, our MVLT (c) adaptively learns hierarchical features by introducing masked vision tasks in an end-to-end framework, significantly boosting the VL-related understanding and generation.
regression task) to a masked image reconstruction task is possible.However, our primary goal is to design a generative pretext task that makes the multi-modal modeling in VL pre-training easier while eliminating the need for using prior knowledge.It will be extremely helpful in our practical application setting with billion-level data.

End-To-End Multi-Modal Schemes
Pixel-BERT [35] is the first method to consider end-to-end pre-training.It employs 2×2 maxpooling layers to reduce the spatial dimension of image features, with each image being downsampled 64 times.Although this work sets a precedent for end-to-end training, such a coarse and rigid method cannot work well in practical settings because it is simply combined with a ResNet [13] as part of joint pre-training, without considering the loss in speed and performance.Recently, VX2TEXT [36] proposes to convert all modalities into language space and then perform end-to-end pre-training using a relaxation scheme.Though it is exciting to translate all the modalities into a unified latent space, it ignores that the usage of data extracted by pre-trained methods as input to the model cannot be regarded as an end-to-end framework.According to the timeline, ViLT [37] is the first method that indeed investigates an end-to-end framework via replacing region-or grid-based features with patch-based projections.However, without other designs, it cannot obtain competitive performance since it is just a vanilla extension of ViT [3].Grid-VLP [38] is similar to ViLT, but it takes a further step by demonstrating that using a pre-trained CNN network as the visual backbone can improve performance on downstream tasks.SOHO [39] takes the entire image as input and creates a visual dictionary to affine the local region.However, this method does not fit fashion-specific applications due to the lack of reliable alignment information.As a result, the vision dictionary may merely learn the location of the background or foreground rather than complex semantics.FashionVLP [40] uses a feedback strategy to achieve better retrieval performance.In practice, they use the well-pretrained knowledge extracted from ResNet and then model the whole, cropped, and landmark representations.Besides, they adopt Faster-RCNN as an object detector for popping out RoI candidates.Besides, some works are designed for end-to-end pre-training [41][42][43], but they are used for specific tasks and are not directly applicable to our research.Despite existing methods employing different approaches to construct an end-to-end scheme, solutions that forgo pre-trained methods (e.g., ResNet, BERT) and use raw data (i.e., text, image) as inputs remain under-explored and are needed urgently in multi-modal applications.which extends the pyramid vision transformer [21] to an architecture that adaptively extracts hierarchical representations for fashion cross-modal tasks.It is the first model that solves the end-toend problem of VL pre-training in fashion, which allows us to simplify the implementation of our MVLT in the fashion industry using a twin-tower architecture [44].

Masked Vision-Language Transformer
Our goal is to build an end-to-end VL framework for the fashion domain.The overall pipeline of our MVLT is depicted in Fig. 3. Like PVT, our architecture inherits four stages' properties and generates features with different sizes.Two keys of the proposed architecture are the multi-modal encoder (Sec.3.1) and the pre-training objectives (Sec.3.2).

Multi-Modal Encoder
As shown in Fig. 3, MVLT admits visual and verbal inputs.On the language side, we first tokenize the caption of a fashion product and use the specific token [MASK] to randomly mask out the caption tokens with the masking ratio1 r l .Following the masking procedure, we obtain a sequence of word tokens.Then, we insert a specific [CLS] token at the head of this sequence.Besides, we pad the sequence to a unified length L using the [PAD] token if the length is shorter than 128.This procedure generates the language input ids T ∈ R L = t 1 ; • • • ; t L .On the vision side, we treat I ∈ R H×W ×3 as visual input, where H and W denote the height and width of the given input.This input is sliced into multiple grid-like patches where N = HW P 2 is the total number of patches and P denotes the patch size.Similarly, the split patches are masked out with mask ratio r v .We provide more details about the above masking strategy for the language and vision parts in Sec.3.2.
The above multi-modal inputs are embedded and fed into the consequent four VL interaction stages (i.e., k ∈ {1, 2, 3, 4}).In the first stage, we generate the vision and language embeddings, T 1 and V 1 respectively, via the given inputs (T and V).Regarding the subsequent stages, we consider only the k-th stage, to have concise illustrations.As shown in the bottom part of Fig. 3, we first embed the language embedding T k ∈ R L×D k into the language hidden feature m k ∈ R L×D k+1 , which is formulated as: where W k t ∈ R D k ×D k+1 and P k t ∈ R L×D k+1 are the learnable linear embedding and position embedding matrices.D k is the size of the hidden feature embedding.
The visual embeddings are where R k denotes the spatial reduction factor of visual embedding.To acquire pyramid visual features, V k are then embedded and flattened into the visual hidden feature via a two-dimensional projection (i.e., Conv2D block).In particular, this projection enforces the network to reduce the equivalent spatial dimen- with kernel size K k and stride length S k .This could be formulated as: where P k v ∈ R N ×D k+1 denotes the position embedding matrix.We then concatenate these two VL hidden features z k = m k ; n k and feed them into multiple (M k ) VL transformer encoders.Each encoder contains the multi-head self-attention layer with spatial reduction (i.e., reduce block), multi-layer perceptron, and layer normalization.Finally, we obtain the encoded multi-modal feature z k+1 = m k+1 ; n k+1 and divide it into a language part T k+1 = m k+1 and a visual part V k+1 = Reshape(n k+1 ), where the Reshape(•) operation consists in recovering the spatial dimension of the given feature.
After four VL interaction stages, we generate the four text embeddings {T k } 4 k=1 and four pyramid vision embeddings {V k } 4 k=1 , respectively.Table 1 presents more detailed hyperparameter settings of our method.Fig. 4 PVT-based architectures offer more options for designing the masking strategy.The vanilla ViT-based method (a) [37] only selects a fixed-scale patch to mask, i.e., P 2 .However, PVT-based method (b) is more versatile because it combines more fine-grained patches as a basic masking unit, i.e., (α × P ) 2 , where α ∈ {1, 2, .., 8}.These masked patches are not overlapped with each other.This characteristic provides a flexible way to learn the suitable semantics by using different values for α.Notably, we adopt a fixed scale factor of masking units in an individual experiment.

Pre-Training Objectives
To acquire discriminative multi-modal representations, we adopt three pre-training tasks to establish the inter-and intra-relationships between the most primitive VL modalities, including vision (masked image reconstruction, MIR), language (i.e., masked language modeling, MLM), and VL (image-text matching, ITM) modalities.
Objective 1: Masked Image Reconstruction (MIR).As for the general domain, models are enough to learn the coarse-grained semantics from the patch-or region-based objectives and achieve satisfactory results.However, the fashion-specific models require more fine-grained representations, such as a suit with different materials (e.g., wool) or collars (e.g., Windsor), which needs a pixelto-pixel vision pre-training objective.Inspired by the masked language modeling [12], we attempt to build the pixel-to-pixel relationships from the perspective of generative tasks, which promote the scalability of visual representations.We design the Masked Image Reconstruction (MIR) to accomplish this idea.To help our model learn better by MIR, we utilize the pyramid characteristic of PVT architecture [21] to design a flexible masking strategy.Unlike the ViT-based method (a) in Fig. 4, PVT-based architecture (b) masks out the input image according to the masking unit matrix that contains small-grained patches.Given the patch sequence V = {v n } N n=1 ∈ R N ×P ×P ×3 , the masked-out sequence V \Φ is defined as: where F M (•; •) represents a function (or procedure) of our masking strategy, q is the randomly selected area of the masking unit, and [ZERO] means that we use a pixel value of zero2 to fill the selected areas.The masking units {M(q; α; Φ)} Q q=1 are derived from the indicator function: where each value in a set of integers Φ is ran- 2 is the total number of masking units.For instance in Fig. 4 (b), we can define α from 1 to 8. In our default settings, we set α = 4 to capture more fine-grained semantics 3 .
Since the smooth-1 loss is less sensitive to the outliers, we use it as the pre-training objective to reconstruct the whole image via the masked-out sequence V \Φ .It is defined as: where I (x,y) and I (x,y) denote the pixel at coordinate (x, y) in the reconstructed image I and the input image I, respectively.I = F MIR (V \Φ ; W MIR ) is parameterized by learnable weights W MIR .Function F MIR (•; W MIR ) denotes a standard four-level U-Net [45] decoder, which admits four pyramidal vision embeddings {V k }4 k=1 as inputs.
The appended classification embedding in the last language embedding T 4 is used to couple the representations from VL modalities.We utilize the function F ITM (•; W ITM ) to denote a fullconnected (FC) and softmax layers, parameterized by the weights W ITM .F ITM outputs a two-class probability vector p ITM = F ITM ( T, V ; W ITM ), representing whether the input fashion image and caption match (i.e., positive pair) or not (i.e., negative pair).The positive pairs are selected from the same fashion product category, whereas the negative pairs are chosen at random from different entries.The binary cross-entropy loss function finally constrains this task: where y ITM denotes the ground-truth label, i.e., 1 for matched pairs and 0 for unmatched pairs.Objective 3: Masked Language Modeling (MLM).Following [46], we randomly use the specific token [MASK] to replace the original text tokens.The target of the MLM is to predict the text content for the masked tokens using the unmasked tokens and patches.Given a tokenized sequence T = {t 1 , . . ., t L }, the masked-out sequence is denoted by T \i = {t 1 , . . ., [MASK] i , . . ., t L }.We use the crossentropy loss to model this objective: where p MLM = F MLM (T \i ; W MLM ) denotes the predicted probability for each maskedout token [MASK] i using T \i .The function F MLM (•; W MLM ) represents the parameters W MLM of a classifier.The final pre-training objective of the proposed MVLT is a combination of the three objectives:

Downstream Tasks
For a fair comparison, we follow the same training/inference protocols as in [1,2] and also adopt the Fashion-Gen 2018 [47] benchmark as the base of our experiments.This dataset contains 67, 666 Table 2 Retrieval (i.e., TIR and ITR) and recognition (i.e., M-CR and S-CR) performances on Fashion-Gen dataset.↑ means the larger, the better.Here, SumR=(R@1+R@5+R@10) ×100 and SumC=(A + macro-F ) × 100."N/A" means the score is not available."Diff" means the numerical difference between the performance of the second-ranked competitor and our MVLT.In particular, we take a product title and its corresponding image as a positive imagetext pair, while the negative pairs are randomly selected from a pool of mismatched images.To increase our experiment's difficulty, we constrain a set of image-text candidates (i.e., a positive pair and 100 negative pairs) in the same sub-category, making them as similar as possible.
Task 2: Image-Text Retrieval (ITR).As the reverse process of the TIR task, the ITR task aims to retrieve a matching image given a sequence of text entries of fashion description, where these bidirectional retrieval tasks (i.e., TIR and ITR) become a prominent member of crossmodal research.Similar to the above selection strategy in the TIR, we prepare a set of candidate image-text pairs, including a positive pair and 100 negative pairs from the same sub-category.We evaluate the zero-shot learning ability of our MVLT without further fine-tuning for these two retrieval tasks.We utilize three accuracy metrics (i.e., R@1, R@5, and R@10) for the evaluation by ranking a series of predicted probabilities.Task 3: Category Recognition (M-CR and S-CR).This task has two parts: main-category recognition (M-CR) and sub-category recognition (S-CR).These tasks act as the fundamental role of practical e-commerce applications that offer the specific category of the queried product.We expect that the model should possess the ability to recognize differences under different granularity levels: 48 main-categories and 122 subcategories, such as {M-CR = SWEATERS, S-CR = CREWNECKS}.After the class embedding in the last language embedding T 4 , we add two independent FC layers to generate the final probabilities for two different recognition tasks.This procedure requires additional fine-tuning with recognition labels.We utilize two recognition-related metrics to evaluate performance: accuracy (A) and macro F-measure (macro-F).Task 4: Masked Image Generation (MIG).MIG task can be viewed as a pixel-wise reconstruction task.Each patch in the image is randomly masked with the probability r v (refer to the pretraining task MIR in Sec.3.2).Then, we ask the model to recreate the whole image using the uncovered areas as visual clues.

Experiments
This section will detail our experiment to determine the factors leading to the success of the proposed MVLT.

Settings
This part provides the hyperparameter settings for our training procedure: i) Pre-training.We PyTorch to implement our method, which is accelerated by 8 Tesla V100 GPUs.We adopt AdamW optimizer with a momentum value of 0.9, a mini-batch size of 1200 (i.e., 150 per GPU), a weight decay of 10 −4 .To avoid over-fitting, we initialize MVLT on ImageNet pre-trained weights [21].The learning rate is initially set to 2.5 × 10 −3 and is changed using a cosine learning schedule.For the visual side, the input image is resized to H=W =256 and split into the multiple sub-patches with a size of P = 4.For the language side, all the product captions are tokenized and padded to tokens with a unified length of L = 128, including classification, caption, and padding tokens.The mask probabilities for vision and language are set to r v = 0.5 and r l = 0.15, respectively.We empirically set weighting factors {w 1 = 10, w 2 = 1, w 3 = 1} to balance the orders of magnitude of different loss values.ii) Fine-tuning.We transfer the pre-trained VL representation to each downstream application via fine-tuning in an end-to-end manner, whose settings are consistent with the pre-training process.

Results
As described in Sec.3.3, we provide the details of four downstream fashion-related tasks.Experimental results show that our MVLT outperforms all competitors, including VSE [48], VSE++ [49], SCAN [26], PFAN [50], ViLBERT [16], Image-BERT [15], FashionBERT [1], VL-BERT [25], OSCAR [29], and Kaleido-BERT [2], which demonstrate the superiority for handling the VL understanding and generation tasks.TIR and ITR.As shown in Table 2, our MVLT surpass the best method (i.e., Kaleido-BERT-CVPR 21 ) on the TIR task by margins of +17.40%, +20.91% across the R@5, R@10.As for ITR, our method delivers more competitive results, with improvements of +17.11%, +22.73% on the R@5, R@10 metrics, respectively.In any case, these results strongly support that our model is powerful enough to match vision and language.They also show how a) MIR and b) end-to-end pre-training are useful in fashion.We believe that MVLT would set a precedent in many industrial applications because it is a simple, cost-effective, and powerful architecture.Besides, we present the visualization results of these two retrieval tasks in Fig. 5. M-CR and S-CR.Compared with BERT-based architectures [1,2,15,29], we also achieve top-1 performances in these two tasks, demonstrating  Final).When comparing D3 to D1 and D2 in the TIR task, we can see that D3 has a better performance in R@5 metric: 74.10% (D1) < 76.00% (D2) < 76.20% (D3).We conclude MLM task can help the model thoroughly learn the language knowledge, so it provides a more precise query to recall better-matching images.
In the ITR task, we find the similar conclusion when comparing (D2) to (D1) and D3 in R@5 metric: 70.80% (D1) < 75.50% (D2) < 76.30% (D3).It indicates that better visual learning leads to an accurate image query to match the most appropriate caption.
Table 4 Ablation study for the contribution of loading PVT's weights pre-trained on ImageNet [51].

More Discussions
How does MVLT perform in general domains?To further investigate the potential abilities in general settings, we here discuss two extended questions.a) Can the general models be directly transferred to the fashion domain?
Inspired by the huge impact of general visionlanguage models, as in Table 5, we further investigate the zero-shot performance of two typical general models (i.e., ViLBERT [16] and CLIP [52]).This has once again demonstrated the necessity and superiority of MVLT pre-trained on the specific domains.b) Can MVLT also work Table 5 The comparison of zero-shot retrieval results on the Fashion-Gen dataset.
TIR ITR R@1↑ R@5↑ R@10↑ R@1↑ R@5↑ R@10↑   well in the general domain?We further verify the potential ability of our MVLT on the general domain.Table 6 reports the performance on MS-COCO 2014 dataset [53], where MVLT follows the same training standards as in [37].It shows that MVLT achieves promising results compared to the latest models (i.e., Unicoder-VL [54], UNITER [17], and ViLT [37]) without extra training data and special retrieval losses during the training.It indicates that MVLT is also a promising solution when extended to general scenes.Why do pyramid architecture and MIR benefit?As mentioned in the introduction, there are two understudied problems in the fashion domain.To solve the transferability problem, pyramidal architecture [21] takes raw data as input without complex pre-processing, which essentially alleviates the applied burden in industry.Besides, MIR does not need human annotations like classification tags, bounding boxes, or pixel-wise segmentation labels.For the granularity problem [55], the pyramidal architecture [21] provides multiscale features with rich semantics.Combined with the MIR task, our framework can represent multigrained fashion knowledge (e.g., dress, V-neck).These features are helpful and urgently required in this field.
A VL model that performs well for semantic understanding tasks (e.g., retrieval [56], classification) can serve as a good foundation and be easily applied to downstream tasks (e.g., text-to-image synthesis [57], image captioning) by utilizing an additional decoder.We didn't conduct image captioning experiments because we focused on basic representation learning in fashion this time.MVLT v.s.MAE [34].MAE learns general representations by allowing the model to explore pixel-to-pixel associations.So MVLT and MAE are similar in this regard.However, our MVLT is the first that introduces the vision reconstructionalike pre-training for multi-modal research (e.g., fashion domain).

Conclusion
We present a vision-language framework named MVLT, which provides two contributions in this field: 1) a newly-designed masked image reconstruction (MIR) objective, and 2) an end-to-end pre-training scheme.The experimental and ablative analysis demonstrates the superiority of various matching and generative tasks.MVLT outperforms the cutting-edge method Kaleido-BERT with large margins on retrieval and recognition tasks, which would catalyze the fashion domain.The designed out-of-box method working endto-end could simplify the workflow (e.g., data pre-processing and model training) for the actual engineering value, which improves development and business efficiency on large-scale e-commerce websites by approximately 50%.
In the future, we will continue to investigate an extremely efficient method in this field using famous technologies like hashing [58], network pruning, and knowledge distil to alleviate the storage and computing limitations in real-world e-commerce applications.
still struggle when applied to the fashion domain in e-commerce because they suffer from the two main issues: † Contributed equally.* Corresponding author.Work was done while Ge-Peng Ji was an research intern in Alibaba Group.

Fig. 2 ,Fig. 3
Fig.3Pipeline of our MVLT framework.Our overall architecture consists of four stages containing language and visual embeddings and multiple transformer encoders (×M k ).Introducing the masking strategy for three sub-tasks, i.e., masked image reconstruction (MIR), image-text matching (ITM), and masked language modeling (MLM), our MVLT can be trained in an end-to-end manner.More details can be found in Sec. 3.

•Fig. 5
Fig. 5 Visualization results on the TIR and ITR tasks in terms of top-five ranked probabilities predicted by our MVLT."Matched" indicates the ground-truth image-text pair.

Table 1
Hyperparameter of our multi-modal encoders.