Fashion Intelligence in the Metaverse: Promise and Future Prospects

.


Introduction
Building a virtual world that is parallel to the real one has been a dream of human beings since ancient times.A science fiction novel written by Neal Stephenson in 1992 [1] originally described a sun-filled virtual world that offered an alternative to the abysmal real world, and the author referred to this as the Metaverse.Since then, the concept of the Metaverse has continued to appear in films and television works.Over the last decade in particular, the Metaverse has expanded considerably.In addition, many technology companies have also drawn attention to the Metaverse, and some of them has even changed their names to Meta, taken from the first four letters of 'Metaverse'.
Essentially, the Metaverse can be regarded as a virtual world that is both parallel to the real one and interacts with it.It resides in a virtual space that mirrors the natural world, and is independent of the real world.A digital virtual human is an element used by an individual in the real world to move freely in the Metaverse.There is no doubt that clothing plays a vital role in daily life in the real world, as it can implicitly reflect a person's internal characteristics, such as their personality and aesthetics, and social characteristics such as social status and occupation.The dressing of a digital virtual human, which is the mapping of the human user in the real world, will also therefore play an essential role in the Metaverse.A suitable outfit in the Metaverse can not only make a digital virtual human more vivid and concrete, but can also represent the characteristics of the person controlling the digital virtual human.In response to this issue, some fashion companies have already created industrial layouts for the Metaverse, such as the production of virtual sneakers [2] and a range of clothing designed for humans and avatars [3].
Fashion artificial intelligence (AI) can affect a wide range of application scenarios in the Metaverse.In fact, the Metaverse is not completely separate from the real world; although it is a world formed in a computer, and is a mapping of the natural world to the virtual world, this virtual world can affect the real one in a straightforward way.In particular, fashion AI can help us carry out certain activities in both the real and the virtual worlds; for example, fashion AI can automatically extract trends from the large amounts of fashion data in the Metaverse, thereby assisting designers to create more data-inspired products.In addition, when consumers are shopping for clothing, they can choose items that suit them more quickly with the help of fashion recommendations.In particular, fashion intelligence can also help us achieve activities that are impossible or difficult to achieve in the real world.For example, compared with real-world fashion data, which are hard to collect, fashion intelligence applications in the Metaverse can utilize these easily accessible online data to make more accurate fashion trend predictions and to help retail companies to characterize market trends.Moreover, fashion editing can help fashion designers to modularize clothing, while fashion generation can simplify the processes used in the clothing industry, from design to ready-to-wear products, as 'what you see is what you get'.Thus, the introduction of the Metaverse means that there is a broader range of application scenarios for fashion intelligence than in the real world.
Although there is a large body of literature in the field of fashion intelligence, to the best of our knowledge there has been no systematic investigation of fashion intelligence from the perspective of the Metaverse.To fill this gap, this research aims to provide a comprehensive survey of fashion intelligence in the Metaverse.More specifically, since fashion and people are inseparable, we conduct this survey from two perspectives, and consider digital virtual humans and fashion intelligence technologies.In regard to digital humans, we use body parts as a basis for exploring the standard technologies for generating these avatars, while in regard to fashion intelligence, we summarize the latest developments in the technologies required for fashion intelligence based on the scenes in which designers and customers are located.Finally, we highlight some extant challenges in order to shed some light on future developments in fashion intelligence in the Metaverse.
The remainder of this paper is organized as follows.Section II introduces the basic concepts of the Metaverse and the classical tasks associated with fashion AI.In Section III, we present our classification framework and a qualitative analysis of relevant studies.The generation of a digital human is explained in Section IV.Section V gives an overview of specific methods and techniques used in fashion intelligence, from the perspectives of both designers and customers.We illustrate the challenges faced in the domain of fashion intelligence in the Metaverse in Section VI, and conclude the paper in Section VII.

Terminology and Background Concepts 2.1 Metaverse
The word 'Metaverse' originates from Neil Stephenson's novel Snow Crash, published in 1992, which described a virtual world parallel to the real world.Each person in the real world had a digital avatar, which was used in the virtual realm of the Metaverse to work, make friends, shop, travel, etc.Currently, there are several definitions of the Metaverse in academia.Mystakidis [4] views the Metaverse as a persistent multi-user environment based on a fusion of physical reality and digital virtuality.Ning et al. [5] state that the Metaverse is a multitechnical, social, and super-temporal virtual world that is parallel to the real world.At present, the Metaverse remains at the conceptual stage, and many existing technologies will need to be combined to create this new virtual world and to integrate it with reality.Of these extant technologies, extended reality (XR), digital twins, and the blockchain form the core of the Metaverse.

Fashion Intelligence
As an essential aspect of daily life, fashion can be regarded as a mirror that implicitly reflects people's attitudes.Fashion analysis based on the use of AI has successfully increased the economic benefits of the fashion industry [6].Fashion intelligence focuses on the application of AI to the fashion industry; in particular, computer vision technologies such as object detection, object analysis, image retrieval, image generation, etc., are leveraged to improve the efficiency of practitioners in the fashion industry and to enhance the consumer's shopping experience.Hence, determining how best to transform fashion data into relevant computer vision tasks and design-specific models appears to be critical for these real-world applications.In practice, fashion tasks can be roughly divided into three categories: low-level pixel-based fashion computing, which can be used for fashion parsing and landmark detection; a mid-level fashion analysis, which aims at identifying fashion items from images and can be used for fashion detection and fashion attribute prediction; and a high-level understanding of fashion, which involves an overall analysis of the attributes of fashion items at the image level and explores the relationships between fashion items for the tasks such as fashion retrieval, compatibility learning and garment recommendations.

Classification Scheme and Analysis
In this section, we classify current research on fashion intelligence in the Metaverse and conduct an overall analysis of these studies.

Classification Scheme
In the field of fashion, items are usually associated with the human body.Only fashion items worn by models can fully display their beauty.Likewise, fashion items can express the mood and personality of the person wearing them.We surveyed the use of fashion intelligence in the Metaverse, from the generation of digital virtual humans to fashion intelligence technology.The generation of digital virtual humans in the Metaverse focuses on efficiently generating realistic or user-friendly 3D human body models.An avatar can express the user's emotions by wearing fashionable clothing in the Metaverse.Fashion intelligence technology concentrates on analyzing and understanding fashion items, and on facilitating the production and sale of fashion items.For clarity, Figure 1 provides a classification of fashion intelligence in the Metaverse.
The goal of the Metaverse is to provide users with an immersive virtual world, and the realistic 3D modeling of humans is essential to achieve this goal.Much research has been conducted on the generation of digital virtual humans, most of which has been devoted to automatically generating realistic 3D virtual human models, with the objective of reducing the dependence of digital virtual human modeling on expert manual modeling.In prior studies of the human body, the task of generating a digital virtual human has been divided into two subtasks: 3D face generation, and 3D human body generation.Extant techniques related to these subtasks will be elaborated in the following context.
The use of AI has brought considerable convenience to the fashion industry [7,8].Increasing numbers of researchers have drawn attention to the mining and analysis of fashion data to achieve an in-depth understanding of fashion elements.In practice, however, designers and customers, the two main groups of fashion practitioners, have different needs in terms of fashion intelligence tasks.Designers prefer technology that can provide them with tools to assist their design processes, while customers expect fashion intelligence to provide them with a better shopping experience.Recently, many techiques have been constantly explored in order to meet or stimulate the needs of both designers and consumers.As a result, the fashion industry is experiencing a strong boom driven by the use of these techiques.

Statistical Analysis
In this subsection, we conduct a basic statistical analysis of the numbers of publications, the year of publication, and existing experimental datasets in the domain of fashion intelligence.We used a public database called DBLP as our search engine, as it contains most of the studies in this field.The main keywords that were used as input to the search engine were "Metaverse", "fashion", and "digital human".We restricted our attention to works published in high-quality journals and conference proceedings, such as CVPR, ECCV, NIPS, TPAMI, etc. Figure 2 shows the distribution of papers published via these outlets, and it can be observed that the top two publication venues were ACM'MM and CVPR. Figure 3 shows the number of yearly publications from 2008 to the present, and an explosive increase in the numbers of publications on the Metaverse can be seen from 2021.In addition, due to the successful application of deep learning [9-12], research on fashion and digital humans has proliferated since 2016.

Digital Virtual Humans
As one of the fundamental components in the Metaverse, a digital virtual human is a representation of a digital identity that allows a player to interact others or with computer agents.A digital avatar is a representation of the user's identity, and is the virtual entity that the user is in contact with for the longest in the virtual world.It is therefore natural for users to want to choose the appearance of their digital avatar according to their preferences.In addition, digital virtual humans can wear fashion items in the Metaverse, such as clothes, earrings, bracelets, etc.This section introduces current research on digital virtual human generation technology and the development of this field from two perspectives: 3D face generation, and 3D human body generation.

3D Face Generation
A realistic face can reduce the user's sense of disobedience and facilitate a more immersive Metaverse experience.The purpose of 3D face generation is to generate a realistic 3D model of a human face that can be driven by audio, face transformations, etc.The task of 3D face reconstruction involves recreating the detailed features extracted from 2D images in the form of a 3D model, especially in terms of shapes, textures, etc.More specifically, 3D face reconstruction can be separated into two primary types based on the number of 2D photos used for feature extraction: single-view reconstruction and multi-view reconstruction.Single-view reconstruction is usually more difficult than the multi-view process, due to the limitations of a 2D image structure.It is frequently the case that a single image cannot provide all the feature information required for face reconstruction, resulting in the need to predict attributes to achieve the effect of a 3D model.Currently, 3D morphable models (3DMMs) [13] serve as the foundation for 3D face reconstruction.A 3DMM is a parametric face model that can generate virtually any face based on a fixed number of points.Faces can be matched one-to-one in 3D space by linearly adding several orthonormal basis weights.The aim of both single-view and multi-view face reconstruction is to obtain fitting parameters for a 3DMM to obtain realistic faces in 3D space.

Single-view 3D face reconstruction
Over the past decade, there has been a great deal of research on 3DMM singleimage fitting.The texture, color, and other features of the face image need to be preserved in the fitted face model as far as possible, and the model must be accurately aligned with the facial contours of the target image.Most traditional methods regard face reconstruction as an optimization problem, where the 3DMM is used to synthesize images based on the unique features of face images, such as facial landmarks, edges, pixel colors, etc.In particular, Choi et al. [14] proposed a framework that automatically estimated all 3D scene parameters from single-or multi-view images.Kemelmacher-Shlizerman et al. [15] presented a single-image face reconstruction model based on a computation of facial similarity.In addition, global human face similarity and face pose estimation were exploited to overcome the significant differences in shape between the input and reference subjects.Furthermore, Romdhani et al. [16] suggested a multi-feature fitting algorithm to improve the convergence properties of conventional face reconstruction models.However, due to the diversity of face poses and the complexity of image backgrounds, conventional optimization methods are sensitive to initial conditions and parameter changes, making the process of single-image face reconstruction relatively fragile for practical applications.Recent developments in deep learning have allowed many researchers to present new ideas for parameter optimization problems.To address the problem with conventional methods whereby they cannot capture nonlinear expressions to create complex expressions, Ranjan et al. [17] introduced a convolutional mesh autoencoder (CoMA) that could learn nonlinear representations of human faces.Richardson et al. [18] presented an end-to-end convolutional neural network (CNN) framework for generating faces in a coarse-to-fine manner.To solve the depth estimation problem in facial reconstruction, Lee et al. [19] proposed a displacement map generation network (DPMMNet) that generated a displacement map to estimate a detailed geometry.
In addition, several methods have been utilized to recover the 3D information of the face from a single image.Image processing methods such as shape from shading (SFS), UV map, thin plate spline (TPS), and epipolar plane image (EPI) have been applied to single-image face reconstruction.For example, Jin et al. [20] first introduced 3DMM to reconstruct a smooth face shape and employed landmark-conducted Laplace deformation to fine-tune this shape.An SFS optimization process was then designed to recover the multiscale geometric details.A position map regression network (PRN) [21] was developed to achieve 3D facial structure reconstruction and dense face alignment.A UV map recording the spatial position of each pixel was fed into a lightweight encoder-decoder for reconstruction of the 3D model.Bhagavatula et al. [22] proposed a new method of 3D face reconstruction that combined feature extraction with the TPS warping function.EPI is an method of estimating scene depth based on the differences between the image pixels at points in the camera plane and the image plane.Feng et al. [23] presented a modelfree approach to reconstructing the 3D face model.Their method was trained with a densely connected CNN architecture called FaceLFnet, based on the horizontal and vertical EPIs of light field images.The authors reported that this method was robust to changes in pose, facial expression, and lighting in face reconstruction tasks.

Multi-view 3D face reconstruction
Unlike single-view face reconstruction methods, multi-view methods do not require strong inductive biases to accomplish model deformation.These approaches can extract facial features from multiple viewpoints in different images to create more detailed 3D models.The efficient fusing of features from multiple views is the key to achieving accurate depth estimation and facial texture recovery.Multi-view methods have attracted considerable attention due to their powerful, fine-grained modeling capability.However, most 3D face reconstruction methods using multi-view face images still rely on generic 3D face models.For example, Wang et al. [24] proposed a 3DMM-based multi-view face reconstruction method that employed multi-view geometric constraints to eliminate ambiguity from images.Subsequently, an adaptive photometric stereo-based reconstruction method was presented in [25].Wu et al. [26] designed an end-to-end trainable CNN network to set 3DMM parameters.Several image processing methods have also been employed for the task of multi-view face reconstruction.In particular, Li et al. [27] used an implicit representation to encode the extensive geometric features of faces, which could improve the generalization performance and quality of 3D face reconstruction.It is very likely that view-based 3D face reconstruction methods will have a multitude of applications related to the Metaverse; for example, these methods can greatly lower the thresholds to the large-scale face reconstruction of users and can reduce the computational overhead in the Metaverse.

Human Body Generation
A digital virtual human is an indispensable digital identity for each user of the Metaverse.All activities in the virtual world, such as communication, picking up items, etc., must be handled via these digital identities.A beautiful and unique digital avatar, which is manually designed and has a rich level of detail, is welcomed by many users of the Metaverse.However, creating a digital avatar manually for each user is not practical, as this would require a lot of time and effort.The purpose of 3D human body generation is to automatically generate realistic human 3D models, which can reduce the cost of the Metaverse.This section gives an overview of current research on 3D human body generation from two perspectives: generic 3D human body models and human body construction from images.A human body model focuses on representing human bodies in 3D space, whereas human body reconstruction is dedicated to generating similar 3D models from 2D images.

Human body models
Modeling the human body has always been challenging for practitioners in both academia and industry.In the past, creating detailed human models required professional artists to generate models manually or the use of 3D scanning to capture the geometry and texture features of the body.However, these methods are time-consuming, require a high level of expertise of the artists, and are sensitive to site conditions.Fortunately, the human body has standard features in terms of shape and pose, which allow researchers to build a parametric 3D body model based on an analysis of high-quality data representing human features.This parametric model can create a detailed 3D human body model base on only a few body features, as well as significantly improving the efficiency of body modeling.There exist two commonly used parametric human body models: shape completion and animation for people (SCAPE) [28], and the skinned multi-person linear model (SMPL) [29].Both approaches represent the human body through a set of triangular surfaces {f 1 , f 2 , ..., f n }, where the vertices of each triangle f i are {v i,1 , v i,2 , v i,3 }.SCAPE [28] is a unified parametric 3D human body model that combines body shape and pose information to achieve a human representation.Drawing on Sumner's idea of deformation transfer [30], SCAPE employs a 3 × 3 matrix to represent the deformation of each triangle as a discrete differential gradient field, which can be used to transfer deformation from one model to another.The introduction of SCAPE is regarded as a milestone in the development of 3D human body modeling, and many studies have been devoted to improving the performance of this method.For example, Hasler et al. [31] designed a model called invariant-SCAPE to solve the problem whereby the triangle deformation in the original SCAPE uses different encodings for the same shape.Hirshberg et al. [32] proposed an optimized BlendSCAPE model that made the joints of the digital body smoother.In addition, Jain et al. [33] proposed a simplified SCAPE model (s-SCAPE) to improve the speed of body modeling.
The deformation in SCAPE [28] depends on the rotational deformation of a triangle patch, which means that human models are unable to be used directly in popular animation software.SMPL [29] was proposed to solve this problem.In a similar way to SCAPE, SMPL [29] employs pose and shape to model the human body.It uses 10-dimensional values to describe the shape of the body.The parameters can be obtained by principal component analysis (PCA) based on the deformation.To calculate the pose representation, SMPL uses a kinematic tree to represent the 24 joint points of the body.Many studies have been devoted to improving the performance of SMPL.For example, SMPLify [34] is a CNN two-dimensional human pose estimation model in which the SMPL parameters (including body shape and pose parameters) were optimized by minimizing the mean vertex-to-vertex Euclidean error between the synthesized 3D pose and the detected 2D joint points.However, this method does not constrain the shape of the body, and the algorithm easily falls into local optimal solutions, causing reconstruction failure.Based on the SMPLify model [34], Lassner et al. [35] added more human joint points (91 points) and obtained accurate pose reconstruction results.Corona et al. [36] proposed a differentiable model for the reconstruction of the body and clothing.
A parametric human body model can be regarded as an essential 3D human body reconstruction technique, in which the aim is to use corresponding parameters as input to construct a precise 3D model of the shape and posture of the human body.SCAPE and SMPL, the two most well-known parametric body models, were developed by leveraging human body datasets to learn the characteristics of the human body shape.Fitting dense 3D point cloud data or depth data of the body to the parameters of a parametric model through point cloud registration, template deformation, etc., is a standard method of reconstructing the human body in fine detail.

Human body reconstruction from images
Human body generation based on 3D scanners requires specialized capturing systems with strict environmental constraints (e.g., large numbers of sensors and controlled lighting) that are very expensive and cumbersome to deploy.Due to its convenience, image-based 3D human body reconstruction has attracted the attention of many researchers over the last decade.Based on the number of perspectives used for feature extraction, it can be divided into single-view and multi-view methods.Single-view human reconstruction is less restricted by the environment than multi-view approaches, and the corresponding accuracy of the reconstructed 3D model is often lower.In a similar way to 3D face reconstruction, 3D human body reconstruction also requires strong prior models as support.Hence, general body modeling methods such as SCAPE [28] and SMPL [29] are widely used for 3D reconstruction.
Statistical body shape models, as a powerful human prior, allow for convenient disentanglement of pose and shape.Fitting the pose and shape of statistical body shape models to a body in a 2D image is an essential aspect of model-based single-view human body reconstruction.In traditional methods, the prediction of human body model parameters is transformed into a model parameter optimization problem.Initially, annotated 2D landmarks and silhouettes were employed [37] as image features to optimize the parameters of the SCAPE model, with promising results.Lassner et al. [35] used auxiliary landmarks on the body surface and added an estimated silhouette to make the model more accurate.Bogo et al. [34] annotated keypoints in 2D images and aligned them with keypoints in 3D models to obtain better results.However, optimization problems rely heavily on the initialization effect of the solution, and are prone to local minima.Hence, many researchers have performed pose and parameter regression through network training by mapping the extracted image features to a low-dimensional parameter space.The basic framework used for 3D human body reconstruction for network regression is shown in Figure 4. Effective feature extraction is critical to ensure an accurate 3D result.Feature extraction strategies such as landmark detection, keypoint detection, body silhouette detection, and semantic segmentation have often been used to improve the fitness of a model.
The parametric general body model can keep the prediction space small in the reconstruction of the human body.However, it leads to the inability of the body model to model the human body in clothing.Therefore, nonparametric methods such as hulls [38], point clouds [39], triangular meshes [40], and voxel grids [41] were used for 3D human body reconstruction.They can predict shape representations directly from images.Natsume et al. [42] implicitly represented the shape of the human body through the contours and joints of the body pose and then fed the frontal image and its mask into a generative adversarial network (GAN) to infer the texture of the human body to model the clothed human body.Moreover, Krajnik et al. [43] proposed a novel method to reconstruct each part of the human body independently.It appeared to have smaller errors than other methods, especially in the concave area of the human body.
Unlike single-view body reconstruction, multi-view reconstruction can describe body features from multiple views, which can reduce the error in the prediction of unobservable body parts.Traditional methods use image consistency and depth estimation to establish the correspondence of joints and other feature points between images with different views; however, these methods are easily affected by occlusion.The task of 3D human body reconstruction therefore employs a depth map of 2D images and fuses them to create a united mesh for 3D body generation.With the help of the multi-view calibration capability of deep learning, many of these approaches have overcome the limitations of traditional methods.For instance, Liang et al. [44] used an image encoder to extract image features and passed these features through multiple regression blocks to predict human body parameters in a stage-by-stage and view-by-view process.Pix2Vox [45] involved the use of a decoding encoder to generate corresponding 3D bodies for humans in each view.Saito et al. [46] designed an end-to-end network to digitize a clothed human body, using a network that employed a pixel alignment implicit function (PIFu) to locally align the pixels in the 2D image with the corresponding context in the 3D body.In addition, Yu et al. [47] proposed a coarse-to-fine linear learning model that utilized graph convolutional networks to deform templates to the ground-truth mesh.

Fashion Items in the Metaverse
As the two most important roles in the real-world fashion industry, designers and consumers play a vital role in the fashion community of the Metaverse.By creating new fashion items, designers can increase the diversity of the fashion community.As users of fashion items, consumers inject vitality into the fashion community through their evaluations and feedback on fashion items.In this section, we give an overview of certain exciting fashion scenarios in the Metaverse from the perspectives of both designers and consumers.The common methods used in these fashion scenes are summarized and classified, with particular reference to the most representative and novel methods in this field.

Designers
The main objective of fashion designers in the Metaverse is the same as in the real world: to create consumer-preferred fashion products.A system that can facilitate fashion tasks in the Metaverse is crucial in terms of helping fashion designers to design satisfactory products more quickly and efficiently.In the following, we describe how carrying out fashion tasks in the Metaverse can help designers.

Fashion Parsing
When a designer wants to create a fashion item, browsing existing items of the same type can help in finding inspiration.However, searching for a particular type of fashion item in a multimedia database is difficult for designers.In the Metaverse, fashion parsing can help designers achieve this efficiently.
Fashion parsing involves segmenting fashion items from images containing multiple such items by labeling each pixel in an image.Fashion parsing is a prerequisite for many fashion tasks, as it can identify the individual fashion items in an image for subsequent processing.Due to the diversity of clothing types, fashion parsing is more challenging than general semantic parsing.In addition, the non-rigid characteristics and the deformed structure of clothing on the body in a given image make it necessary to add semantic information to both the clothing and the human body in order to perform high-level judgments in the task of fashion parsing.In general, fashion parsing methods can be divided into two categories: non-deep learning methods, based on traditional techniques, and deep learning methods, which rely on a fully connected network (FCN)-based image segmentation pipeline.In non-deep learning methods, specific prior rules for label inference are added to traditional semantic segmentation models for fashion parsing.In contrast, deep learning methods rely on the robust feature extraction ability of a neural network to fuse information such as the texture, edge, and shape of the clothing, which are used to enhance the performance of the clothing parsing model.
Clothing parsing tasks have been explored for a long time by researchers focusing on clothing recognition in only a few scenarios [48] or sketch recognition for clothing design [49].However, these works [48,49] are limited to only a few applications, and the results are usually unsatisfactory in practice.Yamaguchi et al. [50] put forward an innovative idea for fashion parsing, in which they used superpixels to simplify the task of fashion parsing and combined human feature estimation to parse clothing.However, their approach requires pixel-level labels in order to carry out model training, which imposes enormous costs in terms of time and manual labor.To address this problem, Liu et al. [51] employed multiple well-trained classifiers to parse fashion items from a given image.Drawing on the idea underlying the scheme in [50], Dang et al. [52] proposed Parselet for human pose estimation and used conditional random fields (CRFs) to perform clothing analysis in the unary and pairwise potential.In order to solve the problem in which the performance of a parser is typically limited by the training data, Liu et al. [53] proposed a fashion parsing algorithm that could be trained on fashion videos.In a later study, Zhao et al. [54] proposed a clothing co-segmentation (CCS) algorithm to automatically segment and extract clothing regions from given images with natural backgrounds.Although the styles of clothing are ever-changing, most clothing of the same type has similar characteristics, leading to the possibility of parsing garments based on data-driven techniques.In particular, Yamaguchi et al. [55] proposed a data-driven fashion parsing method that essentially transferred pixel predictions from samples retrieved in response to a query.
Unlike traditional methods, which require prior knowledge in the form of manual segmentation for preprocessing, deep learning methods rely on receptive fields of various sizes in the network to extract the contextual information on the human body and clothing items in an image.Following the developments in deep learning technology, the successful use of FCNs for general semantic segmentation tasks has attracted the attention of researchers working on fashion parsing.Some researchers have performed clothing parsing by adding subsequent processing steps, such as CRF and additional discriminators, to the FCN architecture [56].One group of researchers have focused on building an end-to-end fashion parsing framework by incorporating CRF into parsing neural networks [57].Fashion parsing methods based on deep learning generally adopt a dual-path network architecture, as shown in Figure 5.In this structure, one path employs an FCN to extract the fine-grained content features from images, while the other employs auxiliary modules to enhance the annotation segmentation pipeline.These auxiliary modules improve the accuracy of clothing parsing by extracting the unique semantic information of fashion items.These modules include texture feature maps [58], outfit encoders [59], edge-preserving modules [57], pyramidal aggregation-excitation context modules [57], and other network flows.
Fashion parsing is one of the most fundamental problems in fashion computing, as numerous high-level fashion tasks such as virtual try-ons, fashion retrieval, etc. are performed based on the output of fashion parsing.The issue of how to improve the efficiency of fashion parsing while maintaining accuracy is therefore the goal of many researchers.In addition, expanding the categories of items that can be parsed is also an exciting topic in this field.

Fashion style learning
Style is an overall semantic attribute of a fashion item, and is jointly determined by low-level attributes such as color, texture and shape.People who wear different styles of clothing show different temperaments.Style is also an essential factor for designers to consider in their designs.Fashion style learning allows the fashion-assisted design systems in the Metaverse to understand the characteristics of fashion styles in a similar way to humans.In general, fashion style learning can not only help designers to classify fashion styles, but can also predict fashion trends.
Style can be regarded as a semantic description of a fashion item.The classification of fashion styles remains challenging, as items with different fabrics, colors and shapes may belong to the same fashion style.Early studies [60] used body detection and descriptions for fashion style classification.With the help of deep learning, it is now possible to directly use images of people as input for the task of style classification.Takagi et al. [61] created a fashion style dataset containing 13,126 images that were classified into 14 categories.They demonstrated the feasibility of fashion style classification through the direct use of a generic classification network.A joint classification and ranking network for weakly labeled data was proposed for style classification in [62], in which global feature extraction was performed on images to measure the similarity between the anchor image and both similar and dissimilar images, and feedback was passed to the classification network for style classification.Identifying clothing style based on local semantic features means that style classification is sensitive to the appearance of clothing items.To address this issue, Yue et al. [63] developed design issue graphs (DIGs) to provide global and semantic descriptions of clothing styles.However, the precise definition of fashion style remains an ongoing research problem.Although extant style classification datasets already contain many style categories based on the knowledge of fashion experts, they still cannot cover all styles due to the rapid changes in fashion trends.Furthermore, since no commonly accepted classification criteria for fashion styles have been developed by fashion experts, the same look may be classified into several different styles.Hence, multi-label prediction of styles is also an important direction for future research on style classification.
The prediction of fashion trends is another important application of fashion style representation.The aim in this case is to capture the visual style features of clothing and then to combine historical cross-domain data containing time series to predict future trends.Al-Halah et al. [64] were the first to propose a fashion style prediction system based on consumer purchase records and images.Later, Zhao et al. [65] designed a system called NeoFashion to predict trends for fashion designers.In similar research, Gabale et al. [66] predicted social media trends in India with an improved object detection model.Jin et al. [67] proposed an end-to-end LSTM encoding-decoding framework for the prediction of clothing trends in various price ranges.Fashion trend forecasting can be seen as a subtask of temporal forecasting.Unlike in ordinary temporal prediction tasks, non-temporal features such as customers' opinions, celebrity outfits, and popular social events may cause sudden changes in fashion trends.Determining how to represent celebrity effects and unexpected events in forecasting is a topic worthy of further discussion in the area of fashion trend forecasting.

Fashion design
In traditional fashion design, designers must spend a great deal of time on carefully selecting colors, fabrics, and textures in order to draw a clothing tile image.Fortunately, computer-aided drawing tools can assist designers in creating clothing templates, which can greatly reduce the workload of designers.However, clothing design requires a wealth of professional knowledge in practice.The Metaverse may lower this barrier to fashion design with the help of AI.It may be that designers and users will be able to specify a few constraints on products in the Metaverse environment, and the system will then instantly generate sketch samples that meet their expectations.Designers will then be able to add further details to these samples to produce a richly textured digital garment.Such a system could greatly improve the working efficiency of designers, and could enable users to create personalized products based on their preferences.Depending on the type of input, fashion design can be divided into single-modal and multi-modal processes.
The aim of single-modal fashion design is to transfer visual elements (such as colors, textures, etc.) from one fashion item to another.However, there are several difficulties with this approach at the transfer stage.Firstly, single-modal fashion designs require high-resolution textural details, and lowresolution fashion items cannot clearly illustrate the effects of style transfer.Secondly, a fashion design system needs to capture the boundaries of a texture filling accurately.In addition, some parts of fashion items do not need texture padding, such as buttons, zippers, etc.With the help of the controllable generation features of a GAN, many researchers have generated refined and user-controllable fashion items.The loss functions commonly used in singlemodal fashion design include the feature loss, style loss L s , pixel loss L p , classification loss L c , texture loss L tex , color loss L col , etc.These losses constrain the images generated by the GAN in terms of style, pixels, texture, etc., and ensure that the generated images do not deviate too far from expectations.TextureGan [68] was the first method to allow the user to control the synthesis of fashion items from sketches and textures.A later system called FashionGAN [69] was designed based on an end-to-end virtual clothing generation network, in which simple textures and corresponding design sketches were utilized to achieve intelligent clothing design.A fashion generation framework called StyleGan was created by Sbai et al. [70], which was designed to generate realistic virtual clothing without input.Using a different approach, Jiang et al. [71] synthesized clothing images by blending them with textures of other items while preserving the global content of the clothing.Recently, Yan et al. [72][73][74] focused on the disentanglement of visual attributes, such as the textures and shapes of fashion images, in order to assist designers in accomplishing the task of fashion design.Current research in the field of single-modal fashion design focuses on the refinement, migration, and filling of visual features such as color, texture, shape, etc. Mapping fashion images to a latent space and transferring the mapping matrix can generate new fashion images with similar textures and colors.However, this method cannot edit a single attribute of a fashion item, such as its color or texture.Decoupling the visual attributes of fashion items remains a challenging topic in the area of single-modal fashion design.
Multimodal fashion design combines fashion images with other types of information, such as textual descriptions of fashion items, to generate corresponding fashion images.For example, Zhu et al. [75] focused on replacing the clothing of a person with a garment described in the form of text.Their method was implemented in two stages: in the first stage, human parsing was used to generate a reasonable human segmentation map, to maintain the shape of the body and the coherence of the text used to describe the human body, while in the second, a generator was tasked with generating clothing images based on the segmentation map and text descriptions.Zhang et al. [76] introduced three attention layers to the second stage of the network proposed in [75] to obtain more refined clothing details.The generation of clothing images directly from text descriptions is also a research focus in the domain of multimodal design.In particular, an enhanced attentional GAN (e-AttnGAN) [77] was proposed to accomplish the task of text-to-image generation.Another system called M6-UFC [78] uniformly leveraged multiple multimodal information to generate new images.The two main research paths in multimodal fashion design involve accurately establishing the mapping relationship between fashion features and text in different spaces and effectively integrating multimodal features, as these can help models to generate more refined fashion items and improve the overall consistency of the generated images.

Consumers
Consumers in the fashion community are expected to have a completely different shopping experience in the Metaverse than in the real world, due to the greater creativity of the Metaverse.The time and distance restrictions of traditional shopping are eliminated, and consumers can shop for impressive clothing at any time, and from anywhere.In the following, we review extant techniques that can be used in shopping scenarios in the Metaverse.

Virtual try-on
If a consumer finds a model in the Metaverse wearing a very attractive outfit, or the clothes in a store catch their eye, it is natural for them to wish to buy such clothing.Trying on clothes directly is an intuitive way for the customer to judge whether clothes suit them.Unlike in the real world, where clothes must be tried on in an offline shop, consumers can wear their favorite clothes at any time, and anywhere, in the Metaverse.Users can freely change their clothes in real time by selecting the clothes they want to try on, and virtual try-on technology is laying the groundwork for these exciting scenarios.The purpose of a virtual try-on is to check the appearance of the target clothing on the user without taking off the clothes that are currently being worn.A virtual try-on can be viewed as a special image-generation task in which images of a Fig. 6 General framework for virtual try-on models.
model wearing the target outfit are created, under limited circumstances.More specifically, a virtual try-on usually takes two images as input: one is a given model image m t , which contains the given human body p 0 and the clothing c 0 , and the other is a target clothing image c t .The output of a virtual tryon system is an image m g , in which the human body p t is shown wearing the target clothes c t and the body shape and pose of the model in the input image are preserved.Semantic information about the clothing and models is also fed into the system as one kind of supervision information.A basic virtual try-on framework is illustrated in Figure 6.
In order to simplify the problem, the backgrounds of the clothing and human body images are usually clear.In a fashion shop, image pairs, i.e., a model wearing the clothes in the target image, are easy to obtain.However, triple image pairs in which the same model has same pose but is wearing different clothes are difficult to collect.This means that each model with different and pixel-wise aligned clothes is usually infeasible.The problem of using unpaired images can be handled in two ingenious ways, as shown in Figure 7.Many researchers regard the virtual try-on task as an image repairing problem; they first mask the region of the body with the clothes that they want to change, to cover the semantic information of the clothing, and the masked image can then be repaired using the clothing item worn by the model for network training.However, since each person is only matched to one clothing item during the image reconstruction process, the performance of a virtual try-on model is usually limited, due to the generalization problem.When the target clothing and the clothing on the model have significantly different visual appearances, the virtual try-on system tends to be ineffective.In addition, a cycle-consistent approach is used to train an end-to-end virtual try-on network.
The training process of this approach is shown in the bottom row of Figure 7.The clothes in the input image are replaced with the target clothes, and the clothes in the output image are also replaced with the original clothes in the input image.Nevertheless, it is still challenging to simultaneously generate the shape and texture of the clothes, human skin, and non-clothing contents using a cycle GAN.Based on the type of the target clothing image, the virtual try-on task can be divided into two categories: the target clothing in a fashion item image, and the target clothing in a human body image.In the first category, in the same way as in a traditional virtual try-on task, the system replaces the region of the clothing on the human body with the target clothing from an image that contains only a single fashion item with a clean background.Replacing the region of clothing in an input image through the target clothes on the human body is another category of virtual try-on tasks.
In practice, transforming a real garment from a shop into a photo-realistic garment fitted to a reference image of a person is an important subtask of a virtual try-on.In response to this issue, many researchers have focused on generating natural, realistic transferred garments and retaining more fine texture.They have usually warped the input clothing to align it with the image of the customer using two general methods: a geometric transformation and a warping module.Geometric transformation exploits spatial information to make the deformed clothes more realistic.TPS [79] is a general method of geometric transformation for garment warping.It has been proven to be an effective coordinate transformation model in many computer vision tasks, such as object recognition, virtual try-ons, etc., and is a basic function used to map the representing coordinates.The clothing from an in-shop image is then geometrically transformed to produce the warped clothing image by TPS.VITON [80] was the first system to exploit TPS for a virtual try-on task, and deformed in-shop clothing to warped clothes with a composition mask.A neural network was also used to learn the transformation parameters of TPS in CP-VITON [81].Later, Fenocchi et al. [82] introduced self-and cross-attention operations to the warping module.They aligned the refined representation of a person and an in-shop garment using two-branch cross-modal attention blocks.In a virtual try-on framework, a generator is typically employed to synthesize the final results, in which a model wears the target garment.The U-net architecture [83] is the most widely used type of generator for this task, as it directly shares the features between different layers.However, the basic U-net architecture [83] is limited to blurred texture and loss of detail in the generated image of the person.To address these problems, several refinement strategies have been adopted to improve the quality of the final results.For example, realistic details from the deformed clothing have been exploited by a network to render blurred regions [80].In the same vein, Ge et al. [84] used warped clothes, human pose estimation, and reserved regions on the human body as input.They combined Res-UNet with residual connections to preserve the details of the deformed clothes and to generate realistic fitting results.
A virtual try-on task can also replace a target garment on a person with the target model.This task focuses on transferring the clothes worn in the original image C o onto arbitrary model images m a , rather than requiring clean product images.However, this task gives rise to different challenges compared to inputting a target garment from a fashion item image with a clear background.For example, identifying and extracting regions of clothing in the input model image m a becomes essential for a natural result.Due to the differences between the person in the original image C o and the model image m a , the problem of aligning the poses of the two bodies is also challenging.In addition, the seamless synthesis between the desired clothing in C o and the model in the target image m a is also a factor affecting the success of the virtual try-on task.In view of these issues, researchers have attempted to handle arbitrary poses, clothing extraction, and other challenging problems by developing frameworks with multiple components.For example, Wu et al. [85] proposed the M2E-Tryon network to transfer clothes from an original image to an arbitrary person.Since the clothing in the input image contains the pose information of the original person, the pose alignment module is a critical component in which the pose of the model is aligned to that of the input person.Similarly to the pose transfer module, the pose alignment component aims to modify the viewpoint and the pose of the human in an image.Dense pose conditioning [86] and human body segmentation [87] are often used to generate pose images in the task of pose transfer.Moreover, a body fitting module [85] and a texture module [87] are widely used to facilitate the task of garment transfer learning.
The focus of most research on target clothing in fashion item images and human body images involves warping clothing and splicing it with an image of a human body.However, most current research studies have considered the VITON dataset, which only contains a single type of clothing, and the performance of these systems on a wide variety of garment types is still unpredictable.In addition, the issue of how to alleviate the dependence of virtual dressing tasks on preprocessing, such as fashion segmentation and pose estimation, is also a topic worthy of further research.

Fashion recommendation
Shopping in the Metaverse can overcome space constraints, and can allow customers to enjoy an immersive shopping experience at any time and from anywhere.Since a fashion store in the Metaverse can offer countless fashion products, it represents a paradise for fashionistas who enjoy shopping.However, customers who have difficulty in choosing or who have less time for shopping will have trouble in selecting suitable products when faced with so many, even if these items can be displayed based on their attributes through a fashion retrieval process.To address this issue, fashion recommendation techniques can be adopted to alleviate the burden of choosing products for customers.This type of system can actively recommend suitable products for customers, acting as a shopping guide during their shopping process.Due to the real-time interaction between the customer and the system in the Metaverse, the shopping experience can be greatly improved.
As a specific type of a more general recommendation system [88], fashion recommender systems have attracted considerable attention from academic researchers and industrial practitioners.The aim in this case is to automatically select clothing that will meet the consumer's preferences or match the customer's needs according to their personal information, dressing scenes, and other information.Compared with a general recommendation system, the task of fashion recommendation has the characteristics of both visual priority and local priority.This means that traditional, general recommendation methods may not be ideal for carrying out fashion recommendation tasks in a straightforward way.
A fashion item with good design has a strong visual expression that is recognized by customers.A recommendation system that considers both the appearance of a product and the user's consumption habits can deliver suggestions that match the customer's preferences.Determining how to represent the visual features of products and how to add them to the recommendation system as essential reference factors are critical aspects of the fashion recommendation task.A CNN framework is a typical means of extracting the visual features of items.As they have excellent feature extraction ability, deep CNNs such as ResNet [89], Caffe [90], etc. are widely used to extract the visual features of items at a high level.He et al. [91] first introduced the visual appearance of items into a preference predictor.A network was fitted with an additional layer that could extract the relevant visual features and latent dimensions to provide recommendations.
In the field of fashion recommendation, the style of a fashion item also has a significant impact on customer preference.Liu et al. [92] proposed a method called DeepStyle [92] to characterize user preferences by learning the style features of items.Style features were then integrated into the widely used BPR [93] framework to generate fashion recommendations.
There is a large body of literature in the domain of fashion recommendation on how to recommend appropriate fashion items to create outfits with existing clothing.This problem can be summarized as a compatibility estimation task, which will be introduced in the later context.However, scenario-oriented and explainable fashion recommendations, among others, are also indispensable aspects of the task of fashion recommendation.Scenario-oriented fashion recommendation recommends suitable outfits for a user based on certain events that the user needs to attend.Liu et al. [94] devised a "magic closet" system that suggested the best matching outfits for a special occasion.Zhang et al. [95] designed a clothing recommendation system that was able to recommend clothing for travelers based on the relevance of the clothing and the destination.When customers receive recommendations from a system, they may also want to know why these clothes were recommended for them.The task of explainable fashion recommendation is more complicated, as it usually involves multiple forms of domain knowledge such as user attributes, regional culture, computer vision, etc. Chen et al. [96] used an attention model to learn the regions that attracted the customers' attention.They claimed that this method could visually illustrate the reasons for recommending the garment by highlighting the key regions of an image.Using another approach, Lin et al.
[97] explained system-based clothing recommendations by analyzing customer reviews.Tangseng et al. [98] proposed a method of quantifying the impacts of different attributes of clothing.They represented the garment in an image by interpretable features of humans and providing the reason to pick the clothing by the most influential item features.Zhou et al. [99,100] introduced outfit generation frameworks to automatically synthesize compatible fashion items when given an extant item.Collecting the highly correlated factors affecting the customer's purchase intention and adding them to the recommendation network is a critical step in fashion recommendation.However, to create a recommendation model that performs well on the market, the visual similarity must not be considered alone, and the regional culture, the personal attributes of target customers, and social networks are all factors that need to be taken into account in real-life applications.

Fashion retrieval
In the real world, customers may find it tiresome to select the clothes they want from a store which is full of merchandise.However, customers may not encounter this irritating shopping experience in a Metaverse store.When faced with a range of countless products, customers can quickly filter the products based on their attributes at any time, to allow them to pick out suitable products.To address this issue, the task of fashion retrieval involves methods of quickly and accurately searching for a specified item from a massive dataset.
The aim of fashion retrieval is to return accurate and relevant fashion products in response to a query by a customer, thus increasing the convenience of purchasing fashion products.A retrieval system usually retrieves data from the dataset that are similar to the query item based on a comparison of visual similarity.Depending on the scenario in which the query object and the returned object are located, this process can be divided into intra-scenario and crossscenario fashion retrieval.Intra-scenario image retrieval searches for similar fashion items from the dataset whose images have the same scenario as the query images.In contrast, in the cross-scenario fashion retrieval task, the scenario of the query fashion images is often different from that of the returned fashion images.For instance, users can search for similar fashion items photographed in daily life from an online shopping image dataset or for similar fashion items in online retail fashion images from street photographs.A valid Non-deep learning fashion retrieval methods can be implemented in two stages.The first stage involves locating and segmenting the region containing a query garment in an image.In the second stage, artificially constructed visual feature representations of segmented garments are captured to enable an image search.Liu et al. [101] first proposed a solution to the issue of crossscene fashion retrieval.An occasion-oriented fashion retrieval approach was also proposed, in which the low-level visual features of clothing and high-level occasion category features were fused with mid-level clothing attributes.A feature representation that was able to characterize the clothing appearance well, using a pose-dependent approach, was used for fashion retrieval [102].These feature representation schemes can facilitate the quantitative analysis of cross-domain clothing image similarity.
Thanks to their powerful feature extraction capabilities, deep learning methods have become the most common solution to the problem of fashion retrieval.A deep network is used to model the similarity of the garments, which is used to determine whether the clothes in two images are the same based on a set of designed rules.The basic pipeline for these deep learning methods is illustrated in Figure 8.
In the intra-scenario fashion retrieval, the similarity can be calculated without an intermediate image between the query clothes and the candidate clothes, as they reside in the same scenario.As shown in Figure 8 (Input (a)), the input can be represented by the paired data sample , where x 1 p and x 2 p are two positive samples representing the same or similar fashion items, and x 1 n , x 2 n is a negative data pair representing unmatched fashion items in two images.The similarity learning of binary sample pairs can be regarded as a binary classification task, in which the aim is to minimize the distance between positive sample pairs and maximize the distance between negative sample pairs simultaneously.In view of this, Kiapour et al. [103] employed a pre-trained CNN with ImageNet to extract feature representations from a query bounding box and the clothing region in shop images.In addition, Kinli et al. [104] proposed densely-connected capsule networks to search for in-shop clothing.
The difference between cross-scenario and intra-scenario fashion retrieval lies in the scenario of the query and candidate images.In practice, it is challenging to handle the discrepancies between fashion items in different scenarios.One commonly used strategy is the use of domain adaptation techniques, in which triple embeddings are adopted to bridge the discrepancies between domains.As shown in Figure 8 (Input (b)), a triple data pair {(x a , x p , x n )} is fed into a deep network to map the samples onto a space.A triplet sample consists of an anchor x a , a positive sample x p , and a negative sample x n .Samples with matching labels are regarded as positive pairs, and those with mismatched labels as negative pairs.Using this approach, Huang et al. [105] proposed a dual attribute-aware ranking network (DARN), which consisted of two sub-networks for feature learning.A sub-network was designed for each domain, and semantic attribute learning was exploited for feature representations.The two sub-networks were connected by feeding the features extracted from each into a triplet loss function.
Fashion retrieval can help customers to select coordinating fashion items from a massive dataset of fashion items in the Metaverse based on the attributes and characteristics of the items.Identifying deformed fashion items and mapping cross-domain attributes are currently research hotspots in the field of fashion retrieval.Exploring the controllability provided by attribute disentanglement and retrieval of unlabeled fashion items is also a worthwhile avenue for future work.

Fashion compatibility
Many people like to ask friends to accompany them when shopping, to help them evaluate the clothes they choose and provide suggestions.However, this may not be possible for someone who has difficulty in finding companions for shopping.Fortunately, there is no such barrier to users when shopping in the Metaverse.The shopping guide offered by a Metaverse store can actively score the clothes chosen by a user in real time, and provide suggestions when a user has difficulty in selecting a match.
The aim of a fashion compatibility system is to estimate how well different types of fashion items match.Learning the compatibility between fashion items forms the basis for many advanced fashion tasks, and represents a challenging task in itself.In practice, it is undesirable to calculate fashion compatibility based solely on visual similarity, as the shapes of different fashion items may be quite different, and the visual properties of two harmonious fashion items, such as their colors and textures, are not necessarily the same.Researchers have developed several models in which harmonic matching is inferred in the fashion domain to enable compatibility learning.Compatibility semantics are usually modeled and characterized based on the deep features of fashion items.Mainstream methods embed fashion items into the underlying representation of the fashion domain through different embedding strategies, and use this underlying representation as the basis for compatibility calculations.Due to their powerful feature extraction ability, deep learning methods of fashion compatibility can map fashion items into deep fashion space, and can learn compatibility based on distance metrics in the mapping space.Hence, the problem of fashion compatibility can be viewed as a specific type of metric learning in which the compatibility of fashion items is determined by computing the independence of their vectors.A further focus for research involves compatibility learning for outfits composed of multiple garments.
The goal of metric learning is to learn a measure of the similarity between two items.In this approach, a pair of items is treated as two feature points x and y in the deep learning space, and a distance function d (x, y) is employed to measure the distance between them.A fashion compatibility system leverages metric learning to learn an embedding space in which the distances between compatible items (positive) are closer than those for non-compatible items (negative).McAuley et al. [106] were the first to introduce low-rank embeddings to metric learning.Chen et al. [107] added a mixed category metric to the scheme in [106], and solved the problem of fashion compatibility by extending the triplet neural network to accept multiple instances in an iterative approach.Sun et al. [108] employed the high-level semantic and visual features of fashion items to learn fashion compatibility.To address the issue of difficulty in accessing fashion datasets for supervised learning, a semi-supervised visual representation of fashion compatibility methods [109] was proposed.An encouraging finding was that this method achieved equivalent performance to fully supervised methods.Item-to-item research in metric learning is relatively abundant [110]; however, limited research has been done on item-to-set metrics.Zheng et al. [111] proposed a general item-to-set metric for the task of fashion compatibility that used the neighboring importance and intra-set importance to filter out instances that were far away from a set.
In outfit compatibility learning, multiple garments are combined into sets to enable compatibility prediction.The compatibility of outfits is evaluated based not only on the visual similarity and semantic information of the fashion items, but also on the types of fashion items that are necessary to compose outfits.For example, Han et al. [112] addressed the task of multi-garment compatibility learning by exploiting a bidirectional LSTM model [113] in which clothing was viewed as a sequence and each item was taken at a step.Hsiao et al. [114] designed a capsule wardrobe that could automatically form outfits from candidate items in a wardrobe to create recommendations.Using another approach, Zhang et al. [115] argued that color plays a significant role in clothing compatibility, and used a graph model to model multiple garments.Pang et al. [116] divided compatibility into three levels and increased the interpretability of fashion compatibility predictions through the use of gradient penalties.In addition, Sarkar et al. [117] designed a system called OutfitTransformer to capture the global representation of an item set and trained a network using a classification loss.
Fashion compatibility involves calculating the overall harmony of an outfit.Obviously, it is insufficient to treat clothing as a sequence and to focus only on the relationships between items, as this approach will overlook the overall harmony between item sets.Fashion compatibility learning has many important application scenarios in both the physical world and the Metaverse.It can help designers and consumers to select clothing, and can provide quantitative matching assistance for high-level fashion technologies.

Future Prospects and Challenges
As an emerging field over the past year, fashion in the Metaverse is attracting increasing amounts of attention from both academia and industry.Many fashion companies have already invested resources into the Metaverse, for example by building virtual spokespersons and running catwalks.Nevertheless, many fashion application scenarios in the Metaverse are still unexplored.This section envisions some novel scenarios involving the Metaverse and highlights the current challenges in this domain.

Overcoming the physical constraints on fashion items
Beauty and comfort are the two most important factors in fashion clothing design.However, it is difficult to achieve both of these simultaneously due to physical constraints such as gravity, clothing fabrics, etc.For example, when designing a suit, designers add shoulder pads to widen the shoulders to make the body appear tall and straight.However, this limits the movement of the wearer, and prevents them from raising their arms comfortably.Fortunately, it is possible to overcome the constraints on the physical properties of materials in the Metaverse.Garments in the virtual world are a set of data that can make people feel comfortable in various activities.In addition, in the real world, a consumer must wear a heavily padded coat to stay warm, whereas in the Metaverse, clothing can automatically regulate body temperature, which makes it possible to wear lighter-looking clothing in cold places.Clothing in the Metaverse has the potential to overcome the physical constraints of the real world.Under these conditions, designers can boldly use their imagination to create astonishing fashion items that could not be made in the real world.

Convenience of fashion design in the Metaverse
Although designers may be inspired by many things in the real world, they cannot carry out an objective evaluation of clothing at the design stage; they typically first need to create a sketch of a garment and produce a physical sample before they can objectively evaluate it.Obviously, this process wastes a lot of the designer's time.In addition, due to the limitations of printing and dyeing technology, the clothing may not be able to be dyed to the color the designer wants.
Fortunately, every stage of the clothing design process is facilitated in the Metaverse.Firstly, designers working in the environment of the Metaverse are able to easily obtain items that can inspire them.Famous designers do not need to spend several months traveling to find inspiration, as in the real world.In addition, designers can directly put clothing on a virtual model at the design stage for evaluation, which eliminates the step of producing a sample garment in the real world.Secondly, clothing designed with computer tools in the Metaverse will not show the deviations that occur in the real world, such as color variations, discrepancies in garment shape, etc.In addition, a modular approach to the design of clothing can be applied in the Metaverse.Designers can employ a computer to preview and evaluate parts of the clothing before the overall design is complete.The Metaverse can therefore shorten the design process and lower the threshold of professional experience required for fashion design, allowing more consumers to join in the design process in an interactive way.

Shopping in the Metaverse
Metaverses are virtual worlds built on networks that can eliminate the physical distances that exist in the real world.Today, consumers generally buy fashion items in two ways: the first is offline shopping, while the other is online shopping and delivered production by express.Both of these methods have drawbacks.Offline shopping requires consumers to spend a lot of time on the road, while online shopping may mean that consumers buy unsuitable items, and several days may be needed to receive them.In contrast, shopping in the Metaverse offers the advantages of both types.Consumers can select and try on their favorite fashion items directly in a Metaverse fashion shop, which allows them to view the fitting in real time.A virtual shopping guide can provide customers with clothing evaluations and recommendations at any time.In addition, consumers can edit the size of the clothes according to their avatar's body, to achieve the most suitable effect.Finally, when consumers have chosen clothing that suits them, they can directly add it to their virtual wardrobe without waiting for delivery, as in the real world.In this way, shopping for clothes in the Metaverse will perfectly combine the advantages of online and offline shopping in the real world, providing the users (or "meta person") with a more comfortable shopping experience.

Expressing emotions and personality through clothing
In the real world, a fashionista can express their mood by the style and color of the clothing that they are wearing.For example, people in a good mood tend to wear brightly colored clothing.However, due to the limitations arising from the physical properties of the fabric, the color and style of clothing cannot be changed according to the mood of the wearer.In contrast, clothing in the Metaverse can overcome these limitations.The Metaverse allows designers to add variable properties to the clothing, such as images and colors, so that consumers can freely edit clothing elements based on their thoughts and emotions.For example, consumers can change their clothing to a warm color to indicate to others that they are in a good mood, while they may change to a cool color to express the idea of keeping strangers away.As a result, clothing in the Metaverse can convey more information and can be used to express the user's personality anywhere, at any time.

Expanding the boundaries of fashion in the Metaverse
In the real world, the human body is the main vehicle for fashion items, and the physical features of the human body are one of the most important elements to be considered in clothing design.In the Metaverse, however, avatars can be of various types, and may be human-like, animal-like or even monsterlike, meaning that the design of fashion items in the real world may no longer be valid in the Metaverse.For example, clothing designed to cover the private parts of humans would no longer work for avatars that do not have private parts, such as puppies, Godzilla, aliens, etc.Hence, the Metaverse would significantly enlarge the range of fashion items, enrich the design ideas for fashion items, and expand the boundaries of fashion.

Fine Modeling of the Human Body and Fashion Items
The great attractiveness of the Metaverse lies in the fact that it depicts a world that is completely different from the real one, and users can immerse themselves in it to experience an utterly different life.This immersive experience can be realized through the fine modeling of hundreds of objects in the Metaverse.Currently, fine modeling of fashion items and the human body relies on 3D scanning and manual modeling by artists with specialist knowledge, making it impractical to model thousands of objects using these time-consuming methods.Although the use of view-based 3D reconstruction methods can improve the efficiency of this process, the generated models have low accuracy, creating a less immersive experience for users of the Metaverse.Hence, the development of low-cost fine modeling methods for human and fashion items is a significant avenue for future work.

Simulation of 3D clothing fabrics
Apparel fabric is an essential factor that characterizes the category and style of clothing.The physical properties of fabrics make clothing with the same style visually different.For example, a silk shirt is softer than a cotton shirt, and the details of their textures are also different.Many researchers focus on modeling the texture of clothing materials, and ignore the simulation of the stiffness of the fabrics.Poor simulation of fabric stiffness can make the avatar's clothing deform and swing more rigidly when exercising, as well as causing unnatural mapping of clothing to different shapes of bodies.The accurate simulation of clothing fabrics in the Metaverse is another challenging topic.

Issues of fashion copyright in Metaverse
Digitization is a feature that makes it easy to replicate Metaverse fashion items, and it is reasonable to expect that the illegal copying and counterfeiting of fashion items will become more widespread in the Metaverse.Hence, strengthening the copyright protection of fashion items in the Metaverse is an essential topic.In addition, the copyright owner of a given style of fashion item in the physical and virtual worlds is also an issue that needs to be discussed.For example, designer A may design a famous sweater S in the physical world, while designer B may digitize this sweater into the Metaverse.The copyright ownership of sweater S in the Metaverse then becomes controversial.The definition and protection of copyright for fashion items in the Metaverse is a topic that needs to be fully discussed and constrained.

Conclusion
In this paper, we have presented a comprehensive survey of the two main elements of fashion in the Metaverse: digital virtual humans and fashion items.In our study of digital virtual humans, we focused on investigating methods of generating 3D avatars, which can reduce the cost in the Metaverse of generating digital bodies.In addition, we reviewed fashion learning and analysis methods that could assist both fashion designers and consumers in the Metaverse.We also envisioned certain fashion scenarios in the Metaverse and discussed several important open issues associated with the future development of the Metaverse.We believe this survey will be instructive for both academics and industrial practitioners, and will shed some light on the study of fashion tasks in the Metaverse.

Fig. 1
Fig. 1 Research topics associated with fashion in the Metaverse and a taxonomy of these techniques.

Fig. 2
Fig. 2 Outlets publishing articles on fashion in the Metaverse.

Fig. 3
Fig. 3 Number of publications in each year on fashion in the Metaverse.

Fig. 4
Fig. 4 Pipeline for single-view human body reconstruction.