1 Introduction

In the last few years and especially during COVID-19 pandemic, online shopping for clothes has become a common practice among millions of people around the world. It shows a great progress and become a habitual activity for many consumers. This progress is conducted by the implementation of virtual try-on technology that enables the customer to visualize the produce on themselves and see how certain the products look on them before purchasing. In 2012, Converse was the first brand that used virtual iPhone try-on by allowing their clients to use phone cameras to see how shoes looked on them, and post photos on social media as well as make online purchases [92]. This technology applies very well to shoes, apparel, accessories, jewelry as well as make-up, where consumers long for a sense of “touch and feel” and they have total freedom regarding decision making, trying, and choosing products at their own pace, without feeling the pressure to make a purchase.

Approximately, 40% customers are willing to spend more if they can try the product through virtual reality [92], due to the fact that try-on experience makes it much easy to explore the many other options as well as customize or personalize the products according to their body shape. For this reason, online shopping for clothes has earned its place deservedly. Popular fashion brands including L’Oréal, Baume, Sephora, Adidas, Nike and Snap are opting try-on technology in order to improve the connectivity with customer and gain a competitive advantage in the market. With statistical proof, the global fashion apparel has exceeded 3 trillion US dollars, in currently year, and presents 2 % of the world’s Gross Domestic Product (GDP). In 2020, a revenue of 718 billion US dollars area attained in the fashion sector and an expectation to reach a growth of more than 8.4% for coming years [73].

During COVID19 pandemic lockdown, most of the business went into kind of a crisis mode and not only big brands, but also small retailers are thinking how they can survive [81]. Taking our time in shops will be difficult in a post-Covid-19 world as a result, online shopping is ingrained significantly in our daily as trade become more and more like shopping in person thanks to the efforts of businesses to add new features and services with the intent of providing their customers the same support and comfort that they would have during an in-person shopping experience. This goal has been achieved by using the computer technology to develop virtual try on applications that assist the fit of garment product to make consumers know how cloths look on themselves, how both the top and bottom matches together, and how the size of clothes fits to them.

Therefore, Online shopping would give more information and availability of all kinds of products to encourage fashion trailers to make the best investment by exploring new sales methods and optimizing the technological process of purchasing clothes like virtual fitting system. These solutions draw a new picture of online shopping experience and bring it to a high level of reality and comfort. One of these improvements is to allow consumers buying clothes after trying them like in real shops because the existing systems cannot provide the possibility for users to try-on various fashion items according to their desires. Thus, fashion brands need to better satisfy customer preferences and engage them with the personalized shopping experience to make more informed and confident purchase decisions. In addition, allowing consumers to virtually try on clothes will not only enhance their shopping experience, but also increase the fashion industries sales because these solutions can play an important role to reduce return rates and improve customer satisfaction.

Instead of using current graphics tools that fail to meet the increasing demands for personalized visual content manipulation, there are many proposed algorithms to address swapping clothes by using recent advances in computer vision tasks like fashion detection, fashion analysis or fashion synthesis. These solutions require considerable effort from researchers to perform the task of changing clothes with preserving details and identities. However, using current image editing technology e.g., Adobe Photoshop or Adobe illustrator cannot give a realistic result due to many challenges of changing clothing in 2D images, such as the deformation of the clothes, different poses, and different textures. Recent studies adopted deep-learning-based methods to encounter these problems and achieve more accurate results.

In the literature, a little number of fashion surveys are proposed [6, 42, 53, 71]. Recently, a summary on intelligent clothing analysis was made by Liu et al. [42]. In addition, Song and Mei [71] presented on overview of fashion development with the emergence with multimedia. Then, a general survey designs the whole picture of intelligent fashion without taken a specific issue [6]. Another survey [53] is proposed to present AI applications in the fashion apparel industry, but it is based only on the structured task-based multi-label classification works. Next and due to the rapid development of computer vision, many tasks are appeared within intelligent fashion, hence, many related works must be updated. In this direction, this survey aims to conduct a comprehensive literature review of deep learning methods applied in the fashion industry by citing research works published in the last years and mentioning their relationship to the early studies. Our contribution consists in responding to the following research questions:

  • RQ1. What is the impact of adoption of Artificial Intelligence (AI) in the garment industry?

  • RQ2. How virtual try on system are developed?

  • RQ3. What are the common problems that need solving to ensure an intelligent fashion shopping?

In this paper, different sections are structured as follow: Section 2 outlines the research framework adopted to realize this research review. Section 3 is dedicated to virtual try-on applications, and divided into two parts, the first one presents the fashion detection tasks including fashion parsing, fashion synthesis, and landmark detection. The second one illustrates the works for fashion synthesis containing style transfer, pose transfer, and clothing simulation. Section 4 provides an overview of fashion benchmark datasets. Section 5 presents the performance of popular works on different tasks. Section 6 shows related applications and future directions. Finally, a conclusion is given in Section 7.

2 Research framework

In this study, a Systematic Literature Review (SLR) [29] is chosen to focus on research works related to virtual fitting system based on 2D images with deep learning methods and applied in the fashion industry. The SLR methodology adopted is shown in Fig. 1. The review process commenced with collecting and preparing data from scientific databases. Subsequently, articles were selected in different phases according to our research framework, and we have selected more than 100 articles from both journals and conference.

Fig. 1
figure 1

Article Classification based on Research Questions

Articles in each tasks of the topic at hand such as fashion detection [10, 13, 14, 28, 30,31,32, 34, 35, 37,38,39,40,41, 43, 44, 52, 55,56,57, 64, 76,77,78,79, 83, 85, 93,94,95, 102, 103] and fashion synthesis [3, 7, 9, 12, 15,16,17, 21, 22, 26, 27, 33, 48,49,50,51, 58, 59, 61, 62, 65,66,67, 69, 70, 72, 74, 84, 86, 89, 90, 96,97,98, 100, 101, 104, 106, 107], were retrieved from popular databases and engines such as Google scholarFootnote 1 and Research GateFootnote 2 . Then, a screening process is used to select specific articles to address the research questions mentioned in previous section. Then, a categorization of research articles must be done according to the main steps used to develop image-based virtual fitting system with deep learning methods. After categorization, there is the process of information extraction and classification of the selected articles based on the key terms of research topic to address our research questions.

As shown in Fig. 1 that presented the article classification according to the research questions, RQ1 is focused on understanding the overall trend of AI in the Fashion industry. Hence, the focus of the screening process was limited to those articles discussing the implementation and execution of AI techniques to improve online shopping. RQ2 aimed at identifying the various stages on virtual fitting framework where the AI method was employed. RQ3 aims to understand the extent of online shopping problems which being a focus of research studies. These keys modules were considered during information extraction from research articles.

3 Fashion virtual try-on

In recent years, advanced machine learning approaches have been successfully applied to various fashion-based problems. The topics of fashion research in the literature of image-based garment transfer are summarized in Fig. 2. One of the branches in fashion research is fashion detection, which aims to label each pixel in the scene (i.e., fashion parsing, landmark detection, and pose estimation), supported by fashion synthesis, which lead us a step closer to a fashion intelligent assistant.

Fig. 2
figure 2

Classification of based approaches for image-based virtual try-on System

3.1 Fashion detection

Fashion detection is an essential task for virtual try-on task, it consists of detecting the human body part to predict the region of clothing synthesis. To apply this task in virtual try-on systems, three aspects must be presented: Fashion parsing, Human Pose Estimation and Fashion landmark detection.

3.1.1 Fashion parsing

Fashion parsing or in other words human parsing with clothes classes, is a specific form of semantic segmentation. This task refers to generate pixel-level labels on the image which are based on the clothing items like hair, head, upper clothes, pants, etc. It is a very challenging problem since the number of garment types, the variation in configuration and appearance are enormous. In Fig. 3, we present an example of fashion parsing results generated by the work of Ji et al. [28].

Fig. 3
figure 3

Examples of fashion parsing based on semantic segmentation [28]

In fashion domain, largest number of potential applications have been devoted to various tasks and particularly to human parsing [10, 39, 41, 93, 94]. At the beginning, Yamaguchi et al. [93] proposed a model by merging the fashion parsing and the human pose estimation. Then, they proposed clothes parsing with a retrieval-based approach [94] to resolve the constrained parsing problem. After that, a weak supervision approach for fashion parsing is presented by Liu et al. [41] who resort to label images with color-category labels instead of pixel-level. These works conduct results far from being perfect because between pose estimation and clothing parsing there is no consistent targets. Many restrictions are presented with these hand-crafted methods because they need to be developed carefully.

To deal with these issues, many methods based on Convolutional Neural Network (CNN) are proposed such as the deep human parsing-based work of Liang et al. [10] which resorts to an active template regression for semantic labeling. Then and with the aim to improve the generated results of their human parsing work, a Contextualized CNN (Co-CNN) [39] is designed to take the context of cross-layer, global image-level, and local super-pixel. In parallel, they proposed a deep human parsing with Active Template Regression (ATR) [39] to ensure the human parsing task by decomposing an image of person into semantic fashion and body regions. In 2018, Liao et al. [40] built a Matching CNN (M-CNN) network to solve the issues of parametric and non-parametric CNN-based methods. In the same year, Gong et al. [13] implemented an important self-supervised method under the name of Look Into Person (LIP) to eschew the necessity of labeling the human joints in model training (Fig. 4). With the intent to ameliorate their previous work [13], the same authors proposed a JPPNet network [102] to treat both the human parsing and human pose estimation task.

Fig. 4
figure 4

Annotation examples for LIP [13] with appearance variability and different views

Different from the previous mentioned works that only concentrated on single person parsing task, there are many others works [14, 64, 85, 103] which focus on treating the scenario with multiple views of persons. Zhao et al. [103] designed a deep Nested Adversarial Network (NAN) to understand humans in crowed scenes. Gong et al. [14] proposed the first attempt to explore a detection-free Part Grouping Network (PGN) used for the semantic part segmentation for assigning each pixel as a human part and the instance-aware edge detection to group semantic parts into distinct person instances. With the aim to manage, simultaneously, single and multiple human parsing, Ruan et al. [64] developed a Context Embedding with Edge Perceiving (CE2P) framework. Recently, hierarchical graph is used for human parsing tasks to improve parsing performance such as the work of Wang et al. [85] that considered the human body as a hierarchy of multi-level semantic parts to capture the human parsing information.

3.1.2 Human pose estimation

Advanced in computer vision are realized by many tasks especially with deep learning-based approaches such as Human Pose Estimation (HPE) that is applied in many fields like fashion fitting to get specific postures from human body by joints’ localization. To overcome the challenges appeared with the task of HPE, many research efforts have been applied to the related fields. We present, in this section, recent research in HPE methods based on 2D images which are classified into two groups: single person pose estimation and multi-person pose estimation.

Single-person human pose estimation

Single-person Human Pose Estimation (HPE) is related to the task of localizing human skeletal keypoints from an image or video data. In the following Figure (Fig. 5), we present results of Single-person HPE obtained from the DeepPose [79] trained on Leeds Sports Pose (LSP) dataset [30]. According to the different structures of HPE task, methods based on CNN can take different aspects such as regression methods and detection methods.

Fig. 5
figure 5

Example of human pose estimation from DeepPose [79] on the LSP Dataset [30]

Regression-based methods produced joint coordinates by learning mapping directly from image [79]. The early deep learning-based network adopted by many researchers was AlexNet [31] due to its simple architecture. Toshev et al. [79] applied this network to learn joint coordinates from full images, and Li et al. [35] employed it as a multi-task framework to predict the joint coordinate from full image. However, Detection-based methods treat the body parts as detection targets based on two main representations: image patches and heatmaps of joint locations. The methods related to this category are intended to predict approximate locations of body parts [32] or joints [52].

Previous works attempt to adjust detected body parts into body models, but there are other recent works [57, 76,77,78] which aim to encode human body structure information into networks. Tang et al. [77] proposed a hierarchical representation of body parts, then, they extended their work [76] to learn specific features of part group. Then, they committed to improve the network structure by proposing a densely connected U-nets and efficient usage of memory [78]. For Peng et al. [57], they exploited data augmentation to avoid the need of more data during training.

Multi-person human pose estimation

The second category of HPE methods is the multi-person HPE which aims to handle detection and localization tasks. It can be divided, according to its different level, into top-down methods and bottom-up methods. Top-down methods used bounding box and estimators of single-person pose to detect person from image and predict human poses. The bottom-up methods put into skeletons the prediction of 2D joints of persons in the image. Figure 6 shows examples of results from the work of Li et al. [38] that belongs to the bottom-up methods.

Fig. 6
figure 6

Example of multi-person HPE [38]

A combination of existing detection networks and single HPE networks used to implement the Top-down HPE methods [55, 56] that achieved state-of-the-art performance in almost benchmark datasets while the processing speed is dependent to the number of detected people. For bottom-up HPE methods, the main components include body joint detection and joint candidate grouping. The two components are handled separately for most algorithms. The bottom-up methods-based works realized perfect performance expect some conditions like human occlusions or complex background.

3.1.3 Fashion landmarks detection

Fashion landmark detection is an important task in fashion analysis, it aims to predict clothes keypoints which are very essential for fashion images understanding by getting discriminative representation. The local regions of fashion landmarks give more significant variances since the clothes are more complicated than human body joints. Figure 7 shows results generated by the fashion landmark detection approach.

Fig. 7
figure 7

Example of results from Fashion Landmark Detection approach [37]. First row illustrates the results on DeepFashion-C [43], second row presents results on Fashion Landmark Dataset (FLD) dataset [44]

For the first time, Liu et al. [43] presented fashion landmark concept and, in parallel, they proposed a deep model called FashionNet [43] applied on predicted clothing landmarks. Then, they proposed a deep fashion alignment framework [44] based on CNN. This Framework is trained on different datasets and evaluated on two fashion applications, clothing attribute prediction and clothes retrieval. Another regression model proposed by Yan et al. [95] used to relax constraint of clothing bounding box due to its difficult application. A more recent work [83] mentioned that optimization on regression model is hard, so, they proposed to directly predict a confidence map of positional distributions for each landmark. Lee et al. [34] resorted to contextual knowledge to achieve perfect performance on landmark prediction.

3.2 Fashion synthesis

Fashion synthesis is the task for generating new style across images and being able to imagine what that person would look in a different clothing style by synthesizing a realistic-looking image. In the following, we review existing methods for addressing the problem of generating images of people in clothing by focusing on style transfer, pose transformation, and physical simulation.

3.2.1 Style transfer

In fashion synthesis task, style transfer is an important step that aims to transfer the style between images. It can be applied in various kinds of image especially facial image and garment image. CNN- based methods applied on this task exploit the feature extraction to obtain style information from image. Isola et al. [26] proposed the style transfer work, pix2pix, which is a general solution for style transfer. For specific goal, based on a texture patch, the work of Xian et al. [90] transferred the input image or sketch to the corresponding texture (Fig. 8).

Fig. 8
figure 8

Examples of image style transfer by TextureGAN [90]

Driven by increasing power of deep generative models, popular virtual try-on applications have appeared [12, 16, 27, 50, 62, 84, 98]. Han et al. [16] proposed a two-stage pipeline called VIrtual Try-On Network (VITON) to transfer desired in-shop clothing onto a consumer’s body by allowing the first stage to warp the input item to the desired deformation style and enabling the second stage to align the warped clothes to the consumer’s image. Many approaches following this pipeline have been proposed with more competitive performance such as CP-VTON [84] and CP-VTON+ [50], which adopt a thinplate spline (TPS) transformation learnable [9] based on Convolutional neural network architecture for geometric matching to align explicitly input clothing with body shape. All these works are powered by the use of TPS, thus, in the following Figure (Fig. 9) we present its application on VITON architecture [16].

Fig. 9
figure 9

Example of Warping a clothing image proposed by VITON [16]: Given the target clothing image and a clothing mask, the shape context matching is used to estimate the TPS transformation and generate a warped clothing image

However, results of these methods are limited in different cases (Fig. 10). One of the main causes resulting in such failed cases comes from warping stage which can be based on inaccurate clothing mask and warped target clothes image used to calculate TPS transformations, thus, its dependence on the shape context cannot be able to perform perfectly on the warping task, and this is the case on VITON [16]. Geometric matching module adopted in CP-VTON [84] utilizes grid points as control points for calculating TPS transformation to reduce image distortions in warped images, which can be seen Fig. 10.

Fig. 10
figure 10

Results from the CP-VTON [84], CP-VTON+ [50] ACGPN [98] and CIT [62]

Then, a second-order difference constraint on Thin-Plate Spline (TPS) is proposed to produce geometric matching yet character retentive clothing images with the ACGPN network (Adaptive Content Generating and Preserving Network) [98]. This method characterized by the existence of an additional semantic generation module used to generate a semantic alignment of spatial layout. It presents important results but with no consideration of the latent global long-range interactive correlation between the person representation and the in-shop clothing. Despite the perfect results generated with these methods [16, 50, 84, 98], there are still a need to obtain more realistic image with no artifacts especially, when there are occlusions or large variations. For these reason a two-stage transformer pipeline is proposed under the name of Cloth Interactive Transformer (CIT) [62] to model the latent global relation in both stages (Fig. 10).

More recently, other works based on in-shop clothes items [12, 27] are proposed to deal with this same problem with the difference that most of the above methods [16, 50, 62, 84, 98] were relied on human segmentation of different body parts to enable the learning procedure of virtual try-on. However, ensure the human parsing task with high performance manner required important training of the corresponding models, for the reason that the poor quality of segmentation guide to highly-unrealistic generated images. To reduce this issue due to the dependence to the masks as an inputs for the models, a Warping U-Net for a Virtual Try-On (WUTON) [27] is appeared as the first parser-free network without using of human segmentation for virtual try-on, as shown Fig. 11. Then, another work called Parser Free Appearance Flow Network (PF-AFN) [12] is proposed in the same context, to produce highly photo-realistic try-on images without human parsing (Fig. 11).

Fig. 11
figure 11

Different architectures for warped Module: a based on segmentation mask from VITON [16], b without human segmentation from WUTON [27] and PF-AFN [12]

The previous works required in-shop clothing image for virtual try-on, but other existing models like FashionGAN [7] and M2E-TON [89] resolved this task basing on text description and model image by giving an input image and a sentence describing a different outfit. First, a GAN generates the segmentation map according to the description and then, another GAN ensures rendering of the output image by the segmentation map. Other works attempts to resolve the problem with arbitrary poses such as Fit-Me [21] which was the first work building virtual try-on dealing with this challenge. Then, FashionOn [22] applied the semantic segmentation to present more realistic results. Then, SwapNet [61] was one of the first works that expose the challenge of transferring all the clothing from one person’s image onto the pose of another target person by operating in image-space. This is done by generating a mutually exclusive segmentation mask of the desired clothing into the desired pose.

Another virtual try-on network called Vtnfp [100] proposed a similar strategy to synthesize photo-realistic images given the images of clothed person and target clothing item. Zheng et al. [106] presented an architecture to try-on clothing with arbitrary poses by using the body shape mask prediction for pose transformation. Based in the same design strategy, Han et al. [17] proposed ClothFlow which is an appearance-flow-based generative model allowing the transfer of different appearances and synthesize clothed persons for posed-guided person image generation and virtual try-on.

Recently, various works [48, 51, 66, 67, 74, 96] address challenging problems of garment interchange between person’s pictures with preserving the identity in the source and target images by developing an image-based virtual try-on network. Feng et al. [74] resolve the problems of visual details and the missing of body parts by maintain the structural between the generated image and the reference image. Outfit-VITON [74] allows the visualization of a cohesive outfit from multiple images of clothed human models, while fitting the outfit to the body shape and pose of the query person. Sarkar et al. [66, 67] achieve high-quality try-on results by aligning the given human images with a 3D mesh model via DensePose [79], estimating a UV texture map corresponding to the desired garments, and rendering this texture onto the desired pose (Fig. 12).

Fig. 12
figure 12

Garment transfer results generated by the work of Sarkar et al. [67]

In the current year, conditioning model is adopted by Dressing in Order (DiOr) [67] to support 2D pose transfer, virtual try-on, and several fashion editing tasks, and a Complementary Transferring Network (CT-Net) [96] is published to adaptively model different levels of geometric changes and transfer outfits between different people. Despite this diversity of these systems, the ability to preserve details or to present, correctly, the shape and the texture is still a challenging task.

3.2.2 Pose transformation

Pose transformation is a crucial task for fashion synthesis, it takes an input image of person and a target pose to generate images of this persons in different poses with the preserving of original identity (Fig. 13). To deal with this task, many works are proposed. Firstly, a pose guided person image generation PG2 [48] is presented with a two-stage adversarial network to achieve an early attempt on the challenging task of transferring a person to different poses by generating both poses and appearance simultaneously and using affine transform to keep textures in the generated results.

Fig. 13
figure 13

Examples of pose transformation results generated by PG2 work of Liqian Ma, et al. [48] from DeepFashion dataset [43] (a) and Market-1501 dataset [104] (b)

The work of Siarohin et al. [70] used a deformable GAN to generate images of person according to a target pose which allowed the extraction of the articulated object pose by resorting to a keypoint detector. Guha et al. [3] address the problem of human pose synthesis with a modular generative neural network that synthesizes unseen poses by using four modules consisting of image segmentation, spatial transformation, foreground synthesis, and background synthesis. Si et al. [69] introduced a multi-stage pose-guided image synthesis framework which divided the network into three stages for pose transform in a novel 2D view, foreground synthesis, and background synthesis. Pumarola et al. [59] treat the limitation of data presented by the above research studies by borrowing the idea from [107] and leveraging cycle consistency.

Last year, the work of Song et al. [72] presented a solution for this limitation by proposing a novel approach which consisted of a decomposition of the hard mapping into semantic parsing transformation and appearance generation sub-tasks to improve the appearance performance. In addition, The generative model, Attribute-decomposed GAN (ADGAN) [49], produce realistic images with desired human attributes. The idea behind this work is to embed human attributes into the latent space as independent codes and then ensure the control of attributes via mixing and interpolation operations in explicit style representations.

3.2.3 Clothing simulation

For more improvement of fashion synthesis performance, the use of clothing simulation is essential. The works mentioned in the previous section are about the 2D domain where clothing deformation is not considered to generate realistic appearance. This important task presented many challenges like the need of creating more realistic results in real-time running with the treatment of more complex garments.

Computer graphics tools was the traditional way for realistic clothes generation models [15, 58, 97]. Yang et al. [97] proposed an approach to recover a 3D mesh of garment with 2D physical deformations by capturing the global shape and geometry of the clothing and extracting important details of cloth from a single-view image. The recovered clothing can be addressed to other human bodies in variety of poses for virtual fitting task. Guan et al. [15] aimed to dress people in a different variation and pose, and clothing types with an automatic process. Thus, they proposed DRAPE (DRessing Any PErson) model to simulate clothes deformation with varying shape and pose (Fig. 14). Then, ClothCap [58] is proposed as a multi-part 3D model to simulate clothing deformation of people in motion from 4D scans. This model ensures the virtual try-on task by capturing a clothed person in motion, extracting their clothing, and retargeting the clothing to new body shapes.

Fig. 14
figure 14

Example of clothing simulation results obtained with DRAPE model [15]

The simulation of the physical deformation has important role to ensure more performance for fashion synthesis due to the generation of dynamic details, clothing-body interactions, and the 3D information. Wang et al. [86] interested on this task and proposed a semi-automatic method to learn the intrinsic physical properties with different postures to generate garment animation which are shown in Fig. 15. The proposed model encoded the main information of the clothing shape and learned to reconstruct garment shape with physical properties by considering the intrinsic garment and the body motion.

Fig. 15
figure 15

Examples of physical simulation from the work of Wang et al. [86]

To improve more realistic view to the garment on human body, Lahner et al. [33] proposed framework consisting of two modules. The first module aiming to recover shape deformations from 3D data of clothed persons in motion. The second module is a conditional Generative Adversarial Network (cGAN) that allowing to ensure realism and temporal consistency and lead the high-resolution details of clothing deformation sequences. Then, Santesteban et al. [65] proposed a two-level learning-based clothing animation method for virtual try-on simulation to ensure performance of the physical simulation with non-linear deformations of clothing. In addition, Yu et al. [101] proposed a physic-based simulation with performance capture called SimulCap. This model ensures tracking of people and clothing using a multi-layer surface. So, it combines the benefits of capture and physical simulation. The contribution of this work consisting of: (1) a multi-layer representation of garments and body including the undressed body surface and separate clothing meshes, (2) a physics-based performance capture procedure using body and cloth tracking for physical simulation and clothing-body interactions.

4 Benchmark datasets

Recent progress in virtual try-on systems have been driven by the building of fashion datasets, despite that, it is difficult to develop a universal dataset to evaluate the whole methods of virtual try-on because there are large variations in different tasks. Therefore, some researchers resort to create datasets to evaluate their proposed methods, this diversity makes the comparison on different algorithms very difficult. Datasets, also, bring more challenges and complexity through their expansion and improvement. This section discusses the popular publicly available datasets for virtual try-on tasks and their characteristics. Large number of benchmark datasets proposed to study fashion applications such as virtual try-on systems are summarized in Table 1.

Table 1 Summary of the benchmark datasets for fashion tasks

As summarized in Table 1, for each task there are specific datasets with according setting. Market-1501 [104] and Deep-Fashion [43] are the most popular datasets for virtual try-on. FLD [44] is the most used dataset for fashion landmark detection. Several datasets were built to treat the fashion parsing task such as LIP dataset [13]. Datasets for physical simulation are different from other fashion tasks since the physical simulation is more related to computer graphics than computer vision. Dataset can be categorized into different types according to real data and created data especially when we are dealing with fashion physical simulation which interested on clothing-body interactions.

Despite the progress on 2D image-based fashion datasets like DeepFashion [43], DeepFashion2 [11] and FashionAI [109], the building of datasets basing on 3D clothing is almost rare or not sufficient for training like the digital wardrobe released by MG-Cloth [4]. Recently, Heming et al. [108] develop a comprehensive dataset named DeepFashion3D which is richly annotated and covers a much larger variations of garment styles.

5 Performance assessment

In image processing, measuring the perceptual assessments of generated results is an important step to validate research works. Therefore, there is an emerging demand for quantitative performance evaluation in image-based garment transfer, which is caused by the requirement to objectively judge the quality of virtual fitting systems to facilitate comparability of the various existing approaches and to measure their improvements.

5.1 Image quality assessment (IQA)

The measure of performance of computer vision tasks is ensured by image quality assessment methods which divided into objective or subjective methods. The last one is based on the perception of humans to evaluate the realistic appearance of generated images. With each year, the number of proposed IQA algorithms are progressively growing, by proposing new one or extending existing IQA algorithms. In this section, we present the most popular IQA algorithms used to evaluate tasks of image-based garment transfer.

5.2 IQA for fashion detection

For clothing fitting based on images, the fashion attributes must be first detected to predict the clothing style. Most works on clothing localization show validate results by using different metrics on different tasks such as landmark detection, pose estimation and human parsing.

5.2.1 Fashion parsing

In fashion Parsing, various metrics are used to evaluate proposed approaches on different datasets such as Fashionista [93] and LIP [13] and in terms of average Pixel Accuracy (aPA), mean Average Garment Recall (mAGR), Intersection over Union (IoU), mean accuracy, average precision, average recall, average F-1 score over pixels and foreground accuracy. Table 2 report some quantitative results measured by these metrics. Most of the parsing methods are evaluated on Fashionista dataset [93] in terms of accuracy, average precision, average recall and average F-1 score over pixels. In addition, There are objective comparisons for virtual try-on, in terms of inception score (IS) [82] or structural similarity (SSIM) [19].

Table 2 Performance comparisons of fashion parsing methods (in %) [28]

IS is used to evaluate the synthesis quality of images quantitatively. SSIM is utilized to measure the similarity between input and output images ranging from zero (dissimilarity) to one (similarity). Further, SSIM is used also for pose transfer to compare the luminance, contrast, and structure information in images to evaluate many state-of-the-art methods. Table 3 shows evaluation metrics including SSIM, IS, masked version SSIM (mask-SSIM), masked version of IS (mask-IS) and Detection Score (DS) [70] applied on Market-1501 dataset [104] and DeepFashion dataset [43].

Table 3 Results of different state-of-the-art methods for fashion parsing [68]

5.2.2 Human pose estimation

Research in HPE has made significant progress during the last years which conducted to the appearance of different work that needed to be evaluated with different metrics to measure the performance of human pose estimation models. The most known metrics in this field are Percentage of Correct Parts (PCP), Percentage of Correct Keypoints (PCK) and Average Precision (AP) which can applied in different datasets.

5.2.3 Fashion landmark detection

The most popular evaluation metrics in fashion detection are Normalized Error (NE) and Percentage of Detected Landmarks (PDL). NE is considered as the distance between predicted landmarks and ground-truth, while PDL is defined as the percentage of detected landmarks according to overlapping criterion. Typically, smaller values of NE or higher values of PDL indicate better results.

5.3 IQA for fashion synthesis

The image quality evaluation is essential for image generation methods to synthesize desired outputs. Recent image synthesis research commonly uses simple loss functions to measure the difference between the generated image and the ground truth, e.g., L1-norm loss, adversarial loss, and perceptual loss. Here, we will present related evaluation metrics to each tasks of fashion synthesis including style transfer, pose transfer and clothing simulation.

5.3.1 Style transfer and pose transfer

Image based garment transfer aims to transform a source person image to a target pose while retaining the appearance details. In this case two essential tasks are required to ensure this goal. That are, style transfer and pose transfer which are very challenging tasks especially in the case of human body occlusion, large pose transfer and complex textures and for measuring the quality of generated images common metrics are used. The evaluation for style transfer is generally based on subjective assessment by rating the results into certain degrees and the percentages of each degree are, then, calculated to evaluate quality of results.

5.3.2 Physical simulation

There are limited quantitative comparisons between physical simulation works. Most of them tend to calculate the qualitative results only within their work or show the vision comparison with related works. Figure 16 presents an example of these comparisons.

Fig. 16
figure 16

Evaluation of the work of Santesteban et al. [4] compared with DRAPE [65] and ClothCap [101]

As shown in this section, the fashion assessment is based on inception score or human preference score. However, inception score focuses more on the image quality, regardless of the aesthetic factors. Human preference score obtained from a small group can be easily influenced by the users’ personal preference or the environment. Thus, one of the challenging tasks in research domain is to build a novel fashion assessment metric that is objective and robust.

6 Application and future work

Automate the manual processes is a great achievement insured by technology advancements especially in the computer vision field. One of the largest industries that is influenced by technology advancement is Fashion Apparel. Due to computer vision powered tools, a great experience can be born for both retailers and consumers. In the following, we present the application of fashion technology uses in various areas and present the future works needed to realize the target benefits.

6.1 Application

Fashion is an ever-changing industry, where trends succeed one another, and companies must constantly rethink and adapt their products and strategies to maintain their position and assure customers’ preference. AI based research appears to be a promising avenue for the fashion industry and can be applied for various activities to enhance the working on this area and maximize the financial gains. Creating AI systems that can understand fashion in images, can create a next-level customer experience like online fashion shopping because apparel industry is basically about visual, thus, it can be dealing with computer vision to recognize images just as we do by making computers understand images.

Here is where the future research work will bring value and become useful for fashion business by making smart shopping. The application of computer vision is mainly done for fashion image analysis, object detection and image retrieval [40, 43]. Many other researchers have represented their ideas for feature extraction and accurate attribute, for fashion related images [16, 62]. Recently, many researchers tried to explore and provide solutions for different fashion tasks using the concepts of artificial intelligence. Several works contributed for fashion recommendation in [20, 80], object detection and classification [37, 43, 44, 83, 95], Image Generation and Manipulation in [17, 67, 70]. Figure 17 illustrates an overview of the AI application in the field of Fashion.

Fig. 17
figure 17

Applications of AI techniques in fashion industry

6.2 Challenges

Going completely online brings a vast number of challenges for fashion retailers and gives an inspiration for new innovative digital products like virtual fitting systems to make the wholesale process completely digital. Published literature presented in this survey show the potential of AI techniques for providing important solutions to implement intelligent systems. Despite that, clothing companies do not widely use these advanced techniques due to various limitations related to these field. A virtual fitting would be a way to see the virtual effects, but it is still far from solved due to several challenges.

Image-based virtual try-on is among the most potential approach of virtual fitting that tries on a target clothes into a customer’s image and thus it has received considerable research efforts in the recent years, however, there are several challenges involved in development of virtual try-on that makes it difficult to achieve realistic outfit such as pose, occlusion, cloth texture, logo and text etc. In this section, we present the most important challenges which can be treated in the incoming studies in the field of adoption of AI techniques in clothing industry.

Try-on image generation

Creating realistic images and videos of persons by considering the pose, shape and appearance is a crucial challenge related to the application of computer vision in many fields like movie production, content creation, visual effects, and virtual reality, etc., In virtual try-on, the body shape and the desired pose of the person highly influence the final appearance of the target clothing item [21, 22, 61, 89, 100, 106]. Thus, diverse questions must be asked to overcome many challenges: (1) How to deform the new clothing item and align it with the target person in a proper manner, and (2) How to generate the try-on image with preserving visual details of the clothing item, and maintaining the body parts of the person, during clothes interchange according to the person pose. Recently, diverse research works [51, 66, 67, 74, 96] take this challenge to respond to these questions and try to solve different issues related to the image generation but it seems that the necessity of obtaining photo-realistic images still persist, thus, the need to improve existing virtual try-on system.

Network efficiency

It is a very important factor to apply algorithms in real-life applications. Diversity data can improve the robustness of networks to handle complex scenes with irregular poses, occluded body limbs and crowded people. The main issue is related to system performance which is still far from human performance in real-world settings [33, 65, 86, 101]. The demand for a more robust system consequently grows with it. Thus, it is crucial to pay attention to handling data bias and variations for performance improvements. Moreover, there is a definite need to perform the task in a light but timely fashion. It is also beneficial to consider how to optimize the model to achieve higher performance. Some existing methods used Transfer Learning and Data Augmentation [57, 61], but we need to focus for more performant methods to achieve high quality results within efficient network.

Virtual try-on DATASETS

Datasets are very important for validating the new models. In particular, deep learning model needs large-scale data for training task. One of the early realistic and large-scale datasets in the fashion area is DeepFashion [43]. So, building new datasets would help quick progress in virtual try-ons and in some cases, there are a necessity to extend existing datasets by using different methods. 1) The GAN worked as a technique of data augmentation which helps in overcome the weakness of existing fashion datasets. 2) Synthetic technology can theoretically generate unlimited data while there is a domain gap between synthetic data and real data. 3) Cross-dataset supplementation to supplement 3D datasets with 2D datasets, can mitigate the problem of insufficient diversity of training data. 4) Transfer learning proves to be useful in this application. Therefore, how to create or extend a large-scale dataset constitutes a promising direction for both image-based dataset and video-based dataset.

Multi-modal virtual try-on

Depending only on the appearance features such as clothing that extracted from RGB images are not robust enough against environment variations as shown in the above methods [7, 9, 12, 16, 21, 22, 27, 50, 61, 62, 84, 89, 98, 100, 106]. Thus, authors should try to combine multiple modalities with complementary information for the final task to improve the accuracy. So, using deep learning on multimodal data is one of new directions in virtual try-on. Also, one of the challenges in the multimodal, needs to be considered in new studies, is developing a framework that handles missing features or modalities that occur by occlusions or pose variations. In the last year, some research works present their interest to this challenge [45, 46].

Unsupervised /supervised fashion research

Most of current deep learning try-on systems depend on supervised learning [16, 50, 62, 84, 98] which train labeled data in the same environment. So, training annotation data in new and real-world environments will conduct to high annotation cost while the deep learning models need enormous data for training and labelling presents a tedious and time-consuming process. To overcome this problem and relieve the labelling burden, it is very useful to work with unsupervised models to extract discriminative features from unlabeled dataset instead of supervised or weakly supervised learning. In fact, current AI approaches require a lot of labeled data to achieve decent accuracy in their predictions. However, since labeling often requires expensive human labor and much time, AI techniques need to evolve toward Unsupervised Learning models that do not require labeled data to train the AI models. The use of this kind of learning begin with some works and become most in-demand in last year [59, 72, 75, 100].

2D/3D virtual try-on

As mentioned in this survey, current methods such as [7, 9, 12, 16, 17, 21, 22, 27, 48, 50, 51, 61, 62, 66, 67, 70, 74, 84, 89, 96, 98, 100, 104, 106] are still far from the built of an ideal virtual try-on system for many reasons related to the input data. Firstly, clothes deformation and occlusion make the garment rendering process very hard. Also, 3D human body modeling for arbitrary poses is still challenging [2, 4, 5, 87]. Thus, new approaches should be proposed to capture detail of shape and clothing.

Fashion generation conditioned on text

Although the advancement on the development of intelligent fashion systems, the automatic synthesis of photo-realistic images from text is needed to obtain perfect results in the design process and to generate realistic images. This need is due to the diverse attributes of fashion images in color, pattern, style, etc. So, research works must focus on how handling complex conditions as well as data sources should be inspired. This challenge is treated with some studies for fashion intelligent system such as Semantic-Spatial Aware GAN [23] and Inspirational adversarial image generation [63].

6.3 Open issues and future directions

Technology has always played an important role in fashion industry and started a more profound and faster transformation that is changing the way in which customers shop and interact with products and brands. At the same time, companies are adopting these technologies to ensure a best shopping experience. Virtual try-on applications present the irreplaceable technology in fashion industry, it provides important benefits to the apparel industries and allows to try-on garment before purchasing, improves accuracy, and suggests well-fitted garment for body type. Throughout the pandemic, virtual try-on has offered a great service to e-consumers and brands unable to demo their products offline.

Virtual try-on solutions represent fit to body, as well as garment pattern design, style, colors to get the perfect results of clothing fitting because the main purpose of retailers is to prove virtual try-on matching with the real garments. Thus, the priority of researchers is to identify the key challenges and the critical success factors that determine the effectiveness of the implementations of digital technologies in the online garment industry to bridge the gap between physical and digital shopping and to attain the challenge of reaching the people wherever they are which has been needed during the pandemic, and will continue to be with the rise of e-commerce.

The implementation of virtual try-on application has the potential to provide a significant benefit to clothing e-retailers but their adoption in the clothing sector is still limited, and even the technological advances, the existing try-on applications are not completely developed yet and still not matured to obtain target results. Most of them are not realistic enough to feel comfortable when try-on a garment item because the structure of clothes is not coherent and done in an artificial manner. Therefore, there are still many unresolved challenges and gap between research and practical applications such as those mentioned in the previous section. This crucial challenges in adopting fashion technologies for fashion industry are appeared because real-world fashion is much more complex than in the experiments.

Following this objective, we present in this paper an interesting review of literature for the virtual try-on task, which can provide researchers with explicit research directions, facilitates their access to the related studies and improve the visibility of adopted methods. Thus, this literature review help to understand from existing works how we can implement an efficient virtual try-on system and how we can understand fashion image. However, people would show different views of themselves in the desired clothing product before making purchasing decision. Considering this objective, a virtual try-on system must be designed and developed, where given a person image, a desired pose, and a target clothing item, it can generate the try-on look of the person with the target appearances and desired poses. We illustrate this process in Fig. 18.

Fig. 18
figure 18

Illustration of the idea of Virtual Try-On System

Towards this end, Most of the systems presented in this paper proceed as follow: they realized at first fashion detection to localize where in the image a fashion item appears or where the different body parts are localized. Then, they swap and interchange clothes between different images of persons and deal with the large variations on body poses and shapes via deep learning models. These studies show that there is significant progress has been made in this direction using learning-based image generation tools, such as GANs, and authorize various range of applications, such as human appearance interchange, virtual try-on, motion transfer, and novel appearances synthesis. However, because of the under constrained nature of these tasks, most existing methods have restriction in the visual quality on generated results and present observable artefacts such as blurring of small details, lose facial identity, unrealistic distortions of the body parts and garments as well as severe changes of the textures. The major procedures are not able to recover the texture details properly. Figure 19 show the result of the recent method of NHRR proposed by Sarkar et al. [67].

Fig. 19
figure 19

Limitation of generated results of the virtual try-on task presented by the work of Sarkar et al. [67]

Despite the important results given by the approaches discussed in this survey, and the power of measuring technologies developed with deep learning methods, several limitations persist like the lack of perfection and the incorrect fit on the human body. Therefore, future studies should focus at providing realistic presentations of different target appearances of the consumers and allow them to virtually choose and try-on preferred clothes, adjust size, style, and color of desired items by using the deep learning-based approaches.

7 Conclusion

The advancements made with AI technologies in fashion industry have not yet reach the goal of modeling the real-world problems which is still very limited and remain challenging, and this is because important hurdles exist at various levels. Thus, the implementation of the AI techniques into this task requires a careful consideration of the various practical features existing in the clothing industry to ensure optimal solutions. The different studies on intelligent fashion analysis surveyed in this paper are just the beginning of this wide research domain because up to now, enormous research efforts have been spent on these tasks and will continue to grow and expand due to the enormous profit potential in the ever-growing fashion industry. This future directions must bridge the gap between research and real industry demand by adding new features and services with the intent of providing customers the same support and comfort that they would have during an in-person shopping experience.