High fidelity virtual try-on network via semantic adaptation and distributed componentization

Image-based virtual try-on systems have significant commercial value in online garment shopping. However, prior methods fail to appropriately handle details, so are defective in maintaining the original appearance of organizational items including arms, the neck, and in-shop garments. We propose a novel high fidelity virtual try-on network to generate realistic results. Specifically, a distributed pipeline is used for simultaneous generation of organizational items. First, the in-shop garment is warped using thin plate splines (TPS) to give a coarse shape reference, and then a corresponding target semantic map is generated, which can adaptively respond to the distribution of different items triggered by different garments. Second, organizational items are componentized separately using our novel semantic map-based image adjustment network (SMIAN) to avoid interference between body parts. Finally, all components are integrated to generate the overall result by SMIAN. A priori dual-modal information is incorporated in the tail layers of SMIAN to improve the convergence rate of the network. Experiments demonstrate that the proposed method can retain better details of condition information than current methods. Our method achieves convincing quantitative and qualitative results on existing benchmark datasets.


Introduction
With the rapid development of the Internet apparel industry, more and more people shop for garments online. Traditional offline garment shopping allows assessment of fit through physical try-on. However, online garment shopping only permits a visual assessment by browsing models, which cannot give firsthand experience. Therefore, more and more researchers are trying to find effective online solutions. Existing virtual try-on strategies are either 2D image-based [1][2][3][4][5] or 3D model reconstruction-based [6][7][8][9][10] methods.
3D model reconstruction-based methods use computer graphics to reconstruct a 3D human model, which can make the result more plausible by controlling model joints. However, 3D model reconstruction-based methods need intensive computation and require a high degree of precision in model construction. They are unaffordable for general users.
Therefore, 2D image-based methods are a better choice for universal online garment try-on. On the one hand, image synthesis techniques can reduce calculation costs, so are suitable for customers without a high-performance processing device. On the other hand, image processing techniques based on deep learning can produce very realistic fitting results. If the shape of an in-shop garment is the same as a garment worn a person, the in-shop garment only needs to be deformed and joined in corresponding regions of the person's image. In other cases, however, arm lengths inevitably clash with sleeve lengths, causing problems such as textural confusion. Similarly, collar type of the result can be affected by the collar of the garment worn by a person (e.g., a V-neck is changed to a crew-neck). We need to find an effective approach to solve these problems.
Other issues must also be considered in the tryon process: (i) body invariant characteristics (e.g., head, pants) need to be preserved, (ii) embroidery and textures of the in-shop garment need to be transformed accurately, and (iii) the resulting image must be seamless, clear, and free from visible defects and noise.
Early researchers conducted pioneering studies. Zheng et al. [11] proposed an image-based garment changing system, which utilizes body factor extraction and content-aware image distortion, and determines joint positions by a neural network. The shape of the model is warped to the body's shape, and head swapping is performed to produce realistic virtual results. Neuberger et al. [12] proposed an outfit try-on approach (O-VITON), which accurately synthesizes the outfit on a body by an online optimization scheme. It has the ideal effect of fitting to the outfit, and in particular, it expands the try-on application to other parts such as pants. To save time and fit all kinds of garments one by one, virtual try-on network (VITON) [2] proposed a coarse-tofine framework to transfer the in-shop garment to the corresponding area of the human body. First, VITON composites the result by coarsely fusing the in-shop garment to the corresponding part of a person. Then, it refines unclear areas of the coarse garment by a refinement network. In contrast to VITON, characteristic-preserving image-based virtual try-on network (CP-VTON) [4] trained a special geometric matching network [13] to be used with thin plate spline (TPS) [14] to warp the in-shop garment so as to retain rich details. Then it finely integrates the warped garment with the body by a try-on network. Later, CP-VTON+ [15] improved the warping effect of CP-VTON to give better results. Adaptive content generating and preserving network (ACGPN) [16] proposed an effective architecture, which solves the preservation of image details by generative adversarial networks (GAN) [17] and produces realistic fitting results with careful alignment of the garment. However, its results may be flawed due to failure to preserve collar type, sleeve shape, and arm details.
To overcome the challenges above, we propose a novel virtual try-on framework. Its key processes are as follows: (i) Geometric matching network (GMN) [13] coarsely warps the in-shop garment and uses the warped garment to generate target semantic map, then (ii) SMIAN generates the refined warped garment and body components (arms and neck) using the target semantic map, and (iii) SMIAN integrates the generated body components to obtain the final virtual try-on result.
The main contributions of the paper are: • a novel image-based virtual try-on framework which effectively synthesizes the in-shop garment and the reference image by componentizing the garment and person, • a novel component generating network, SMIAN, which generates and integrates high-quality body components, which avoids texture confusion by refined generation of body parts individually, and • to prevent the garment covering the arms, and incorrect collar type and sleeve shape, an anti-covering map and neck semantic map are introduced, which effectively increases the authenticity of generated images. The remainder of the paper is organized as follows: Section 2 presents related work. Section 3 describes the proposed virtual try-on framework in detail. Section 4 reports experimental results. Section 5 draws brief conclusions.

Conditional image synthesis
The contribution of GAN [17,19] to fashion image processing is enormous. Conditional GAN (CGAN) [20][21][22][23] makes generated images controllable according to given conditions. Cui et al. [24] proposed an end-to-end virtual garment display method to render sketches and garment fabric. Jetchev et al. [1] proposed conditional analogy GAN (CA-GAN), which defines the virtual try-on task as an image analogy problem and adds a cyclic consistency loss function. However, it can only roughly transform properties and cannot adapt to geometric deformation in generating image details. Lee et al. [3] introduced the adversarial mechanism to the warping and tryon stages. It adds GAN loss to optimize the fit of the garment, making the generated image more plausible. Our method enhances the visual effect of the generated results based on GAN.

Human parsing and understanding
Estimation of human pose [25][26][27] and human semantic segmentation [28][29][30] are widely used in human-centered image research. Long et al. [31] first proposed a CNN-based method called fully convolutional networks (FCN) for semantic segmentation. Gong et al. [28] proposed a new benchmark, Look into Person (LIP) providing a significant advance in terms of target diversity. Moreover, they studied self-supervised structure-sensitive learning for body parsing and body estimation. In the virtual try-on task, a human semantic map segments the human image to obtain the body parts needed for the experiment. The human pose can provide a warping guide for the shape of the in-shop garment and arms. Therefore, in our method, both are necessary data for generating virtual try-on results.

Virtual try-on
Current virtual try-on methods are divided into 2D image-based and 3D model reconstruction-based tasks. Mir et al. [7] proposed a method to transfer textures of garment images to 3D skinned multiperson linear (SMPL) model [32]; it is more accurate and faster than methods based on TPS. Zhao et al. [33] proposed a novel monocular-to-3D virtual try-on network, M3D-VTON, which generates a nonparametric 3D mesh model based on the generated 2D try-on result, creating a new virtual try-on mode. Yang et al. [16] proposed a content generation preservation network, ACGPN, which can adaptively determine which parts of a person's image should be preserved. It dramatically reduces artefacts and blurring in the generated results. However, it still does not entirely solve the excessive dependence of the generated result on the garment worn by a person, so the sleeve shape and collar type of the result do not match the inshop garment. Cui et al. [34] proposed a flexible person generation framework, DiOr. It effectively performs the work of virtual try-on through a recurrent generation pipeline. Choi et al. [35] proposed a threemodule framework using a high-resolution dataset and synthesizes the in-shop garment using an improved residual module. However, it is insufficient for good arm retention. Our method overcomes these challenges by body componentization.
In Table 1, we compare various state-of-the-art methods in terms of implementation and performance. Our method splits the virtual try-on problem into a multi-component problem, which has some superiority in virtual try-on tasks.

Overview
In this section, we first illustrate how to estimate a semantic segmentation map from an in-shop garment, which is used to guide the generation of human components (see Section 3.2). Secondly, we explain the general structure and sub-modules of SMIAN, which is used to generate components and synthesize the result (see Section 3.3). Thirdly, we explain the generation strategy for each component and specific implementation details (see Section 3.4). Finally, we describe the training loss functions of GMN and SMIAN (see Section 3.5). The whole framework is shown in Fig. 1.
Given a reference image I ∈ R 3×H×W , an inshop garment c ∈ R 3×H×W , a reference pose map p t ∈ R 18×H×W , and a reference semantic map s ∈ R 1×H×W , the task of virtual try-on is to transfer the in-shop garment c to corresponding areas in the reference image I to generate an output imagê I ∈ R 3×H×W , that is, the virtual try-on result. For this task, we propose a novel framework T:

Segmentation generation
Recently, image-to-image translation [36,37] has been widely employed to generate desired images due to its  remarkable effectiveness. Inspired by this approach, we first need to generate a semantic segmentation map of a person to guide subsequent image generation.
Generating a segmentation aims to produce a target semantic map s t ∈ R 20×H×W containing the shape of the in-shop garment c. Specifically, in the virtual tryon task, the semantic map remains unchanged except for the garment, neck, and arm areas. Therefore, we combine the garment, neck, and arms in the reference semantic map s into one region in a reallocated semantic map s m . This enables other parts of s m to be used as boundary conditions for warping: the combined area is the range of warping. In the target semantic map s t , s m is reallocated according to the shape of the warped garmentc.
However, the human pose is flexible, and there are many poses in which the arms and garment overlap. During training with these poses, it is difficult for the network to learn which part represents the arms in the semantic map, causes the arms' semantic map to be occupied by the garment's semantic map. To enable the network to easily locate the arms in general, we create an anti-cover map A c ∈ R 1×H×W (connecting the shoulder, elbow, and wrist with a one-pixel wide short line) as input conditioning information to highlight the arms' positions for more accurate segmentation.
As shown in Fig. 1(a)(below), we adopt U-Net [38] as the generation network. Furthermore, s m , A c , and the coarsely warped in-shop garmentc (see later) are used as inputs, and the weighted cross-entropy loss L s [28] is used to optimize the network.

Basics
We propose a novel semantic map-based image adjustment network (SMIAN), which aggregates dual-modality features from the source image to reconstruct body components through the semantic map. As Fig. 2 shows, SMIAN consists of three modules: a parsing module, a content module, and an integration module. Unlike the generator in Ref. [39], which adds several style blocks between the encoder and decoder, the decoder of SMIAN consists of several AdaIN ResNet blocks (ARBs).

Parsing module and content module
The content and parsing modules have the same structure, and both consist of five down-sampled convolutional layers. However, their roles are different. The parsing module is used to obtain the feature map of the varying semantic map of components for input to the integration module. The content module is used to obtain content information (e.g., colour, texture, embroidery) of the components and spatial information from the semantic map. It provides the basis for the ARB in the integration module.

Integration module
The integration module analyses the feature map extracted by the parsing module to reconstruct component. It consists of seven ARBs and five up-sampled convolutional layers. StyleGAN [19] successfully applied adaptive instance normalization (AdaIN) to the progressive generative model, which distributes the features to the latent variables. Inspired by this, we introduce several ARBs to finely reconstruct body components, using AdaIN to restructure the spatial distribution information of the semantic map and the content information of the component image. The AdaIN calculation is where x is the feature map produced by the previous ARB or the previous convolution layer, and y is the feature map produced by the multi-layer perceptron (MLP). σ and μ denote mean and standard deviation, respectively. The formula adjusts the mean and standard deviation of the semantic map to the component image. We define the input feature map of the ARB as F r−2 and F m (from the MLP). Note that the input signals from the content module are all the same in the ARB. As shown in Fig. 2, feature processing operations in the overall ARB can be defined as where r = 3, . . . , R, and R is the number of execution steps. φ r−1 denotes the convolution operation in step r−1, and σ r−1 denotes the Relu activation function in step r − 1. Following the standard ResNet block [40], a skip connection structure was added to fuse the input and output feature maps using: where F r is the output of the ARB.

Body component generation
We now describe how we create the human's arms, garment, and neck as body components by SMIAN to minimize mutual interference in the resulting image distribution.

Garment component (GC)
As shown in Fig. 1(a)(above), the in-shop garment mask M c (see Fig. 3, additional shape information), the reference pose map p t (see Fig. 3), and the reallocation semantic map s m are used as the inputs to the GMN [13], to warp the in-shop garment to generate the coarsely warped garmentc.
Because there are only a few controlled parameters θ in GMN that manipulate the warping of the garment, an error in one of the parameters θ can lead to over-warping (unnatural excessive partial distortion of the garment) or under-warping (garment not fully aligned with the garment semantic map).
To overcome over-warping, we introduce a sampling interval consistency loss L sic [3,15] into the GMN to limit the spacing between sampling points; it is given by whereĜ x andĜ y are the x and y coordinates of the sampled grid, respectively, and the absolute difference |a − b| measures the distance between two adjacent nodes a and b. L sic totals the distances of all points to adjacent points in the sampled grid. As shown in the garment part of Fig. 1(c), to overcome under-warping, the predicted garment semantic map s c t and the warped garmentc are processed by SMIAN to repair missing areas and remove redundant areas. In this way, the garment componentĉ is obtained viâ where concat(·) denotes channel-wise concatenation.

Arms component (AC)
Different in-shop garments have different sleeve shapes. Therefore, the arms I u in the reference image cannot be directly reused as input. The simplest way to handle this is as follows: (i) reusing the correct arms directly, (ii) repairing missing areas in the arms using the generator, and (iii) removing unnecessary areas in the arms using the generator. However, as Fig. 3 shows, while the arms and in-shop garment correspond in the training set, in practice, they do not always correspond due to the different garment shape. As shown in Fig. 4 and the arms part of Fig. 1(b), to overcome this problem, during training, we randomly crop the arms I u to provide input to enable the generator to learn an inpainting capability. Furthermore, we perform an AND operation between the randomly cropped I u and the arms semantic map s u t to remove possible background.
During testing, only the AND operation is performed between s u t and the arms I u to remove unnecessary areas of arms I u . This maximally retains I u of the original arms, which can be expressed as where denotes an AND operation, and rand(·) denotes random cropping.
Finally, the arms componentÎ u is obtained by SMIAN. It can be formulated aŝ

Neck component (NC)
As Fig. 3 shows, the reference semantic map s of the VITON dataset does not contain a neck semantic map, so the neck's shape is unchanged before and after try-on. Therefore, we add a neck semantic map to s to remove limitations due to neck shape. The neck component is handled similarly to the arms component. As shown in Fig. 5 and the neck part of Fig. 1(b), during training, the neck I n in the reference image is randomly cropped to learn an inpainting capability. During testing, only the AND operation is performed between s n t and the randomly cropped neck to remove unnecessary areas of the neck I n . This maximally retainsĪ n of the original neck, which can be expressed as In this way, the collar shape of the in-shop garment determines the neck shape in the result. Finally, the neck componentÎ n is obtained by SMIAN. It can be formulated aŝ I n = SMIAN < concat(s n t ,Ī n ), s n t > (11)

Component synthesizer (CS)
All the components (ĉ,Î n , andÎ u ) and unchanged parts I inv in the reference image I are spliced to generate the try-on resultÎ. However, in the actual splicing operation, cracks occur between boundaries  of the components due to edge errors in the semantic map. As Fig. 1(d) shows, we use SMIAN to repair the cracks with guidance from the target semantic map s t . This allows a natural transition between components to be realised, and noise in the image is also reduced. This can be described aŝ

Training
During training, pixel-wise L 1 , perceptual loss L per [41], and L sic are used to optimize the GMN, which can be expressed as L cw = λ 1 L 1 + λ per L per + λ sic L sic (13) where λ are trade-off hyper-parameters for the corresponding loss functions.
The SMIAN in the framework need to be trained separately. The total loss L SMIAN consists of L 1 , L per , L adv [37], and L fm [37]. It can be expressed as L SMIAN = λ 1 L 1 +λ per L per +λ adv L adv +λ fm L fm (14)

Experiments and analysis
In this section, we first introduce the experimental dataset, VITON [2], and describe the implementation details of the experiment. We then verify the execution performance of SMIAN, and qualitatively and quantitatively compare our results with those of other state-of-the-art networks. We also conduct ablation experiments to demonstrate the effectiveness of each submodule in the framework. Finally, we conduct a user study to demonstrate the practicality of the proposed method.

Dataset
The experimental dataset is from the VITON [2] dataset. It contains 16,253 image groups. Each group consists of a front-view female image I, an in-shop garment image c and its mask M c , a reference semantic map s, and a reference pose p t . The size of each image is 256 pixels × 192 pixels. The dataset contains 14,221 groups in the training set and 2032 groups in the test set. Figure 3 shows a sample from the training set and another from the test set in VITON. It can be seen that the in-shop garment and the garment worn on paper are the same in the training set, while in the test set they are different.

Implementation details
Our experiments were carried out on 2 Tesla V100 GPUs with 32 G RAM. By default, the learning rate for the generator and the discriminator were 0.0001, reduced linearly to 0 over half of the epochs with a batch size of 4. The experiment adopted the ADAM optimizer [42], with parameters set to β 1 = 0.5, β 2 = 0.999. In the loss function, λ 1 = λ per = 1, λ sic = 40 in L cw . λ 1 = 1, λ per = 10, λ adv = 1, and λ fm = 10 in L SMIAN .
The discriminator in SMIAN is the one from pix2pixHD [37], as shown in the discriminator in Fig. 1.

Performance
To assess the execution performance of SMIAN, we compared the convergence rate during training and differences of test results, using L 1 , L per , L adv , L fm , and L SMIAN (vertical axis) as an indirect representation to compare performance between SMIAN and the U-Net (used in CP-VTON, CP-VTON+, and ACGPN).
In Fig. 6, to show the effect clearly, we test the result once every four images (horizontal axis). The total loss L SMIAN of the proposed method stands at about 2.0 after iterating, while that for U-Net remains at about 3.5. The convergence rate of the other sub-losses is also better than for U-Net. During testing, the proposed method also has lower total loss: L SMIAN of each image from the proposed method stands at about 4.65, and for U-Net, about 6.75. In the other sub-losses, the loss differences in each batch of images show a remarkable distance. The pixelwise L 1 loss represents a slight difference between the image generated by SMIAN and ground truth. L per , L adv , and L fm indicate more realistic images are generated by SMIAN.
The above results demonstrate that our method has significantly improved network performance and better generated image quality.

Qualitative results
We next provide qualitative results of the proposed method and compare visual outputs with those from three state-of-the-art models, CP-VTON [4], CP-VTON+ [15], and ACGPN [16].

Semantic map correctness
Figure 7(above) shows the effect after adding the neck semantic map. Column 5 (for CP-VTON) is the effect without the neck semantic map, where the reference image limits collar type. The neck semantic map in column 6 (for ACGPN) is in error. The last column is the effect of the proposed method, where the collar type dependence from the garment worn on the person is eliminated. The collar type, which changes to the shape of the in-shop garment, has a more natural appearance in the result. Fig. 7(below) shows the effect after adding the anticover map A c (column 3). Column 5 (for CP-VTON) is the effect without the anti-cover map; the garment covers the arms. The arms semantic map in column 6 (for ACGPN) is in error. The last column shows the arms part is generated clearly to prove that the anti-cover map works as intended. Figure 8 compares our method with state-of-the-art methods in terms of garment alignment, using a grid image (column 2) as a visual representation of the degree of warping. CP-VTON (column 4) exhibits over-warping because it does not impose constraints. In addition, it uses the rough shape as condition information causing the absence of warp boundaries. Although CP-VTON+ (column 6) incorporates constraints, its input contains a rough shape, and the network lacks perceptual loss as an optimisation function, leading to local over-warping. ACGPN (column 8) excessively limits the spacing by second-order-difference constraints, which results in an almost uniform degree of warping throughout the garment. It enhances the sleeve area in the garment, which as a result is extremely unnatural. Specifically, we show the result without L sic in column 10, where the garment shows over-warping. In contrast, our proposed method shows a more natural warping effect without over-warping by using L sic , s m , and perceptual loss in column 12.

Comparison of try-on results
As Fig. 9 shows, CP-VTON, CP-VTON+, and ACGPN have defective garment alignment which affects the final result. The way the garment and body are integrated in CP-VTON leads to problems such as texture confusion. Although CP-VTON+ works well on restoration of collar type, the lack of a strategy for retaining original details causes problems such as occlusion and loss of detail. There are many errors in the ACGPN semantic map, which produce unsatisfactory results.
Our method uses interval consistency loss and perceptual loss to overcome over-warping, making the results more realistic. Under-warping and mismatches of sleeve shape and collar type between the result and in-shop garment are avoided by semantic adaptive componentization. Finally, our method preserves the most original content in the reference image and in-shop garment. Compared to CP-VTON, CP-VTON+, and ACGPN, our method works better.

Coupling study
The application goal of the virtual try-on task is to try on the desired garment online, so we chose four different garments for a coupling study. As shown in Fig. 10, the proposed method works well.

Quantitative results
We further analyzed performance using benchmark metrics for image quality, adopting structural similarity (SSIM) [43] defined in Eq. (15) and Fréchet inception distance (FID) [44] defined in Eq. (16) to measure the similarity between the try-on result and ground truth. The inception score (IS) [45] defined in Eq. (17) and peak signal to noise ratio (PSNR) defined in Eq. (18) are adopted to measure the image quality between the try-on result and ground truth.
where μ X is the mean of X, μ Y is the mean of Y , σ 2 X is the variance of X, σ 2 Y is the variance of Y , σ XY is the covariance of X and Y , and C 1 and C 2 are two variables to ensure stability.
] (16) where T r denotes matrix trace , μ is the mean, and Σ is covariance. IS(G) = exp(E x∼p g D KL (p(y | x)) p(y))) (17) where p(y | x) is a particular classification obtained from the generated data x, p(y) is the edge distribution of the obtained classification, and D KL denotes relative entropy. PSNR = 10 log 10 255 2 ε (18) where ε denotes mean square error (MSE) between the ground truth and generated image. Table 2 summarizes the performance of state-ofthe-art methods. Our method improves IS from 2.85 (for DCTON [47]) to 2.86, which reflects the fact that the distribution of the generated image is closer to the ground truth distribution. Original detail retention strategy for arms and garment component improves SSIM from 0.83 (for DCTON) to 0.87. Also, PSNR increased by around 10% from 23.067 (for ACGPN) to 25.423. The component synthesizer contributes to defect-free fusion between components, reducing noise and improving FID (for DCTON) from 14.82 to 12.63. The experimental data shows that our method provides convincing virtual try-on results.

Discussion and ablation study
We performed an ablation study to verify the utility of each part of our framework. The ablation study We observe from the ablation experiment results in Fig. 11 that: (i) the ARB preserves rich component details by correlating the spatial distribution between component and semantic map, (ii) the AC preserves arm details from the reference image very well, with fingers and arms distinguished, (iii) the GC effectively prevents any mismatch between arm length and sleeve length of the in-shop garment, and details of the garment are well preserved, and (iv) the CS fuses all components through semantic map guidance, gaps between components are repaired, and noise is reduced, making the results more natural and realistic. Table 3 provides the same metrics as before for these four cases, the data in the table once again confirming the points above.

User study
To further evaluate the effectiveness of our approach, we designed a user study using a questionnaire. First, the results obtained from three virtual try-on methods on the test set were mixed. Then, we invited two volunteers from fashion design and computer vision to score all test images (in the range 0 to 1). Finally, we calculated the average score as the satisfaction for each image and plotted the scores as a statistical graph. The results in Fig. 12 indicate that the image quality obtained by the proposed method provide a better sensory experience than CP-VTON and ACGPN, intuitively indicating that our method is a superior method for virtual try-on field.

Conclusions
In this work, we have proposed a novel virtual tryon framework. To preserve details of the human body and garment, SMIAN is proposed to accelerate network convergence rate and optimize the generation effect. It improves the performance of the virtual try-on framework.
Moreover, the body parts to be synthesized are componentized for local-toglobal generation, solving existing problems such as occlusion and loss of detail. The componentization of the body area also reduces coupling in the result, which helps the network pay more attention to local details. Compared to the state-of-the-art works, our pipeline provides quantitatively better results and visual effects. User satisfaction is increased to 83.5%. In future, we plan to expand our framework to deal with image-based pose transfer with complex appearance-aware information.