1 Introduction

In the digital world of fashion, it is appealing to most people that personalized avatars are used for clothing try-on and online shopping. With the tremendous advancement in the Internet and smart mobile devices, personalized avatars play a key role in fashion virtual reality applications, such as virtual trying-on [1, 2], virtual fashion recommendation [3, 4], and immersive online shopping experience [5]. According to a recent study [5], the personalization of avatars significantly increases self-identification and ownership for the users (see Fig. 1), because facial information, such as face shape, skin color, and hair color, are pivotal aesthetic references in the selection of clothing and makeup [4]. Nevertheless, in most virtual try-on systems, personalized avatars are rarely used, as either special and costly 3D scanning equipment or computationally expensive methods are required for creating personalized avatars [1, 2, 5,6,7,8,9]. Although some recent endeavors on image-based reconstruction methods, such as contour-based shape reconstruction [2, 10] and landmark-based pose reconstruction [8, 9] are reported for personalizing human models, they are mainly focusing on body shape modelling with facial part either being neglected [2, 8,9,10] or else only being coarsely reconstructed [11].

Fig. 1
figure 1

Personalized avatars increasing the individual-identity and body ownership (source images from internet)

Compared to other common 3D objects, the reconstruction of perceived faces needs to address additional challenges. First, unlike common objects that contain mostly low-level visual features, human faces have many high-level visual features, which are difficult to recover [12]. Second, a face can represent multiple facial attributes, such as identity, gender, age, expression, and so on, which is different from common objects that mainly present one key attribute [13]. Therefore, the reconstructed face models should accurately restore these facial attributes. To alleviate the problem, some studies [7, 14] separated the reconstruction of the head and body. They first reconstructed personalized faces [15, 16], then integrated the reconstructed personalized 3D faces with the body models reconstructed by other methods or from other templates [9]. For face reconstruction, the use of 3D morphable face model (3DMM), a statistical model derived from 3D scans by means of principal components analysis (PCA), is a readily available and well accepted approach [15, 17]. By adjusting a set of parameters, the 3DMM can be deformed to represent different 3D face shape and appearance. Researchers obtained personalized avatars by combining morphable face models with head models and then with human body models. For example, Fang et al. [7] replaced the head of the human model with the Basel Face Model (BFM) [15]. These combined models, however, still exhibit inconsistent geometric structure [17], for instance, existing methods are not able to handle the case when the head, eyes and hair are being isolated meshes. Moreover, the reconstructed facial appearance of these models has limited presentation ability [18] when the facial texture and appearance are also learned statistically by PCA. For example, the BFM [15] was statistically learned from only 200 identities. The scarce 3D dataset and the Gaussian nature of statistic model limited the ability of such a texture model to represent individual identity.

Another important reason impeding the widespread commercial adoption of personalized avatars is the topology diversity of different avatar models being used in different applications or different platforms. In the present work, a method is developed to separately transfer reconstructed facial shapes and appearances to full-body avatars. Although the present study only focuses on realistic face shape and facial appearance reconstruction for avatar personalization, the proposed method can be used on avatar models in diverse body shapes and with different topologies, making it work seamlessly with other body shape customization methods [2, 8,9,10]. The key contributions of this paper are as follows:

  • An inexpensive and topology-free approach is developed for transferring face reconstructions based on single-view images to full head models. This method circumvents the need for specialized equipment and overcomes the limitations associated with different head topologies.

  • A method is developed to transfer the reconstructed texture from face to full body of the avatars, enriching the details of the personalized face appearance and thus enhancing the visual fidelity and realism of the resulting human avatars.

  • The method is integrated and applied to virtual try-on systems, preserving individual identities, and improving the realism of the try-on experience for users. This contributes to a more accurate representation of how clothing and fashion items would appear on an individual, enabling users to make informed decisions and fostering a more satisfying and engaging virtual shopping experience.

The rest of the paper is organized as follows. The relevant research work concerning virtual avatars and personalized face reconstruction are first discussed in Sect. 2. Next, the proposed method is detailed in Sect. 3, which is supported with experimental evaluation and analysis in Sect. 4. Lastly, Sect. 5 concludes this study and provides recommendations for future work.

2 Related work

2.1 3D human body reconstruction

Research on 3D human body reconstruction has mainly concentrated on reconstructing human pose and body shape, with the facial region mostly being neglected or reconstructed roughly. For instance, in 3D human body reconstruction, a skinned multi-person linear model (SMPL) [8, 9] was proposed as a statistical full-body model to enable rapid pose and shape reconstruction for computer games and animation. Ji et al. [10] trained a dual neural network to reconstruct 3D human body from a frontal-view and a side-view silhouette image, which can reconstruct body from sparse features but still lacked the details of face. Other researchers [19, 20] introduced depth images into deep-learning based 3D human body reconstruction taking advantage of more spatial information provided in depth images than RGB images. Nevertheless, the resolution of depth images are often limited thus it cannot provide information sufficiently detail for face reconstruction. Subsequently, Romero et al. [21] and Pavlakos et al. [11] supplemented the hand and face model with additional joints and animation for SMPL. Furthermore, Joo et al. [14] introduced a fusion model merging the SMPL body with the head model of FaceWarehouse [16] to create a comprehensive body model with a more detailed facial expression and animation. Other researchers utilized deep neural networks to enhance the facial animation capability of human model [22,23,24,25], which still lack the details and individual identities of human face reconstructions.

Previous research on human avatar reconstruction mainly focused on body shape and pose, whereas facial appearance are also important for certain section of the fashion industry, such as in virtual try-on [24] and make-up recommendation [3, 4] and immersive shopping [5]. Considering the difficulty of creating human avatars with strong individual identities, some methods bypassing facial reconstruction were proposed. For example, Yuan et al. [6] suggested a mixed reality virtual clothing try-on system, which combined the rendered body model with the real face image of the user and which enabled personalization of fashion avatars. Nevertheless, the outcome is neither lifelike nor consistent between the face and body. Certain researchers proposed a pixel-aligned reconstruction of the body with clothing from 2D images by deep neural networks in voxel space, such as PIFu [26] and PIFuHD [27], allowing simultaneous body and face reconstruction. Their work is, however, based on a voxel representation which is a rather low resolution due to the high use of storage and computing resources. Furthermore, these voxel-based models are less compatible with the majority of industry applications in computer graphics pipelines compared with mesh-based models. Hence, the models reconstructed by PIFu are generally not suitable for use in most fashion applications.

Recently, in order to personalize mesh-based human avatars, Ma et al. [28] reconstructed a holistic human model based on SMPL, and trained different models to reconstruct the face, hand, and body, respectively. This method, based on the standard SMPL model, could achieve very coarse reconstruction and is only good for animation purposes. Fang et al. [7] proposed a weak local registration method for combining a customized BFM face model with a human body model as if wearing a “mask”. This method can customize faces for human avatars with different topologies, but it still has drawbacks, e.g., when reference landmarks of the facial region of the body model are not precisely defined, this will result in face model distortions. Ploumpis et al. [29] combined two different models by transferring their principal components by means of a regression matrix, which enabled a natural transformation between different topologies. Inspired by this work, the present study develops a method for transferring a reconstructed customized facial shape and appearance to full-body avatars with various topologies, providing a natural personalized face representation for human avatars.

2.2 3D morphable model in face reconstruction

To further personalizing facial representation in human body reconstruction, the reconstructions of facial shape and appearance are crucial [4, 5]. As discussed in the Introduction, 3DMM is a fundamental and effective method for 3D face reconstruction, in which methods were developed using single images as inputs, or other types of input, such as depth data of RGB-D [30], multi-view image sequence [31, 32] or video [33]. Blanz and Vetter [15] proposed the first statistical 3DMM, which is trained by means of a dataset of 100 males and 100 females using the PCA method. Nevertheless, due to the limited data, the representation potential of this model is rather limited. To extend the representation of 3DMM, Cao et al. [16] built a 3D morphable head model with data collected from 150 individuals, each with 20 different pose angles. Considering that the more 3D data available, the higher representation of the 3DMM face model would have, Booth et al. [34] built a head model from a large 3D head dataset. Zhang et al. [35] proposed a 3D morphable head model trained by means of 3D scan data of individuals differing in age and sex. Yang et al. [36] collected a large head scan dataset of Asians and built a full head model based on it. Li et al. [17] built a full head model which contained three head joints for shape and pose deformation. Dai et al. [37] proposed a full head model with texture trained by means of a large 3D dataset. Yenamandra et al. [38] proposed a full head model containing face shape, texture, and hairs, although the resolution of the model was low, and the reconstructed facial and hair silhouettes were unrealistic.

2.3 The reconstruction of human facial features involving shape and texture reconstructions

2.3.1 3D face shape reconstruction

Facial shape reconstruction can be subdivided into data-driven and optimization-based methods, respectively. For the data-driven method, deep neural networks performed outstandingly in terms of image processing due to the abundance of large datasets [39]. Nevertheless, annotated 3D face data is scarce, which makes supervised deep learning in 3D face reconstruction difficult. Hence, self-supervised learning methods [40,41,42,43] were introduced to reduce the requirement on annotated 3D data, which combine deep learning with 3DMM-based optimization processes. The outputs of those methods are, however, still restricted by the statistical characteristics of the training dataset.

Compared to the data-driven method, the optimization-based method does not require a large face dataset and is not limited by the training data. The optimization-based method only requires certain features extracted from the input images, such as landmarks, edges and segmentations. In the early optimization-based work, facial landmarks were used as input for 3D face reconstruction [15, 44]. Nevertheless, landmark features can only provide coarse information about the face shape and perform less robustly under large facial pose angles. Thus in recent studies, landmarks were mainly used for the initialization or face tracking stage. Bas et al.[45] employed contour edge features to refine the shape reconstruction and make the reconstruction robust under various pose angles. Inspired by previous work [15, 44, 45], the present study introduced landmarks detected from Dlib [46] and contour edge features [45] in an optimization-based 3D face shape reconstruction.

2.3.2 Face texture reconstruction

In face texture reconstruction, Blanz and Vetter [15] developed a 3DMM-based texture model for the BFM face model, which obtained RGB colors through optical devices and built texture model based on these data through PCA method. The optimization of the texture parameters was based on minimizing the pixel-wise distance between the rendered facial image and the segmented facial region of the input image. However, the facial texture reconstructed from the 3DMM model often has a low resolution and a limited representation ability due to its limited training data and Gaussian nature. Comparatively, sampling texture directly from input face image into the UV space can retain the details and individual features, yet these samples are often incomplete because of limited view angles and occlusions of the input face images. To enrich the details and individual features for face texture reconstruction, some studies have combined sampled facial texture with 3DMM facial textures or have introduced generative adversarial networks (GANs) for texture completion (called image inpainting). For example, some studies [47] [48] combined the 3DMM texture model with textures sampled from the input image in the UV space, which improved the robustness of the texture reconstruction under occlusion and large pose angles as well as enriching the facial texture details.

With the development of deep learning, GANs [49] showed good performance in image inpainting. Inspired by the work on image completion, Deng et al. [50] proposed a UV-GAN to complete the UV face texture sampled from in-the-wild images. Gecer et al. [18] proposed a GANFIT model, which combined GAN with differentiable rendering to reconstruct facial models with detailed UV texture. In addition, Lattas et al. [51] collected a professional facial texture dataset, which contained both diffuse, specular albedos, and normal maps paired with RGB images. They trained a deep neural network to reconstruct the texture from a single image by means of supervised learning. Nevertheless, these methods based on GANs are often time consuming and their training datasets often unavailable. For example, OSTeC [52] needs around 5 min and GANFIT [18] takes more than 30 s for texture reconstruction. Furthermore, collecting a training dataset requires expensive professional equipment [51]. In the present study, individual information sampled from the input images was combined with 3DMM texture and used the differentiable rendering for facial texture reconstruction purpose.

3 Method

In the present study, a 4-stage method is developed to personalize human avatars of various topologies, where Fig. 2 depicts the entire pipeline: (1) reconstruction of the facial features based on the 3D Morphable Model in stage one (see Sect. 3.1); (2) transferring the reconstructed face shape to a full-body fashion avatar in stage two (see Sect. 3.2); (3) reconstruction of the facial and body texture in stage three (see Sect. 3.3); and (4) alignment and joining of the head with the body model in stage four (see Sect. 3.4).

Fig. 2
figure 2

Full pipeline of personalized fashion avatar generation

3.1 Preprocessing

Before the process, facial landmark detection and face segmentation are required from the input face images, where the detected facial landmarks are used for shape reconstruction in stage one and the segmented face regions are used in texture reconstruction in stage three. Moreover, in most avatar models used in fashion applications, facial landmarks are typically absent, and therefore the landmarks of the 3D head models must be estimated for transferring customized face shape to the head of fashion avatar in stage two and later head and body alignment in stage four. Facial features and the preprocessing are shown in Fig. 3.

Fig. 3
figure 3

Preprocessing steps: (a) image feature extraction, given a face image, 2D-landmark detector[46] is used to estimate 2D face landmarks, and a pretrained face parsing network [54] is used to segment face image to different regions; (b) 3D landmark estimation, face image is rendered from 3D head template, from which landmark detector [46] is employed to estimate 2D landmarks that are back projected to head mesh as 3D landmarks

3.1.1 Landmarks detection on input images

For aligning the face model to the facial image to initiate face reconstruction, the detection of facial landmarks from images is necessary, with pretrained models being available for landmark detection [44, 46, 53]. Because the present work requires a very accurate landmark definition, Dlib [46] has been used to detect 68 landmark points (see Fig. 3(a)).

3.1.2 Face segmentation on input images

For texture reconstruction, a pretrained face parsing network [54] has been used to segment the input face image into respective regions for facial skin and hair (see Fig. 3(a)). This network was trained on CelebA-HQ dataset with state-of-the-art performance of 93.86% accuracy [54] in face parsing.

3.1.3 3D landmark estimation for head model

Most 3D fashion avatars lack annotated facial features [2, 7, 8] whereas landmarks are required in the subsequent processing, such as facial shape transfer [14] and alignment [7, 9, 29]. It is therefore necessary to estimate facial landmarks for the envisaged fashion avatars. Considering that accurate landmarks can be detected from 2D images than detecting 3D landmarks from mesh models, the present method proposed to first project the 3D head model of the avatar onto a 2D image, on which 2D landmarks are detected using detector [46] and then reversely project to the 3D mesh. Figure 3(b) show the said processing of 3D landmark estimation on 3D head model.

3.2 Face shape reconstruction

3.2.1 3DMM

The 3D morphable model (3DMM) [55] forms the foundation for face reconstruction, being a statistical face model built from 3D face scan data. By means of PCA, the shape and texture of a 3DMM face model can be constructed by linearly weighting the principal component parameters of 3DMM. In this study, Basel Face Model (BFM) [15], one of the 3DMM face models, was employed for the face reconstruction. The 3D morphable facial shape, expression and texture are formulated as:

$$S={S}_{mean}+{{\alpha }_{s}\bullet P}_{s}+{{\alpha }_{exp}\bullet P}_{exp}$$
(1)
$$T={T}_{mean}+{{\alpha }_{T}\bullet P}_{T}$$
(2)

where \({S}_{mean}\in {\mathbb{R}}^{3N}\) represent the mean shape of the 3D face model, with N vertices in 3-dimensional coordinates, \({P}_{S}\) represent the principal components of shape and \({\alpha }_{s}\) is the parameters for controlling the shape of the face model. \({P}_{exp}\) represent the principal components of facial expression of the FaceWarehouse model [16] and \({\alpha }_{exp}\) is the parameters for controlling the facial expression. \({T}_{mean}\in {\mathbb{R}}^{3N}\) is the mean texture model for the facial appearance in the color space, \({P}_{T}\) is the principal basis vectors of the texture model, and \({\alpha }_{T}\) is the parameter of the texture model.

3.2.2 Personalized face reconstruction

Reconstructing 3D face models based on 3DMM from single input facial images can be broken down into estimating facial shape, texture and pose. In this study, BFM is used to reconstruct 3D faces from input face images, with orthogonal projection \(P\left(\bullet \right)\) being assumed as the camera model. The reconstruction of face can be considered as optimizing the \(\mu\) parameters of the 3DMM face model, including shape parameter \({\alpha }_{s\_f}\), texture parameter \({\alpha }_{T\_f}\), and pose parameters \(\left[R, t,s\right]\) consisting of rotation, translation and scaling factors as follows:

$$\mu =\left[{\alpha }_{s\_f},{\alpha }_{T\_f},R, t,s\right]$$
(3)

The various parameters are optimized by means of the overall loss function, which evaluate the distance between the render image of the predicted face model, \(I{\prime}\), and the input image, \(I\). Here the landmark loss \({L}_{lmk}\left(\mu \right)\), contour edge loss \({L}_{edge}\left(\mu \right)\), and photometric loss \({L}_{photo}\left(\mu \right)\) are used for the face reconstruction.

The landmark loss function

Measures the differences between the predicted landmarks and projected landmarks. The landmark loss plays a crucial role in predicting the pose and initial face shape, and it is defined as follows:

$${L}_{lmk}\left(\mu \right)=\frac{1}{N}\sum_{i=1}^{N}{\Vert {v}_{i}^{lmk}-P\left({v}_{i}^{{\prime}lmk}\right)\Vert }^{2}$$
(4)

where N is number of facial landmarks, \({v}_{i}^{lmk}\) are landmarks detected from an input image \(I\) and \({v}_{i}^{{\prime}lmk}\)projected landmarks of the estimated face model.

The contour edge loss function

Is used to fine-tune the reconstructed face shape, since landmarks can only provide rough shape information [45]. It measures the distances between the points on the contours detected from the 2D image, and their nearest points on the occluding contours projected from the predicted 3D facial shape. The contour loss function is as follows:

$${L}_{edge}\left(\mu \right)=\frac{1}{{N}_{C}}\sum_{j\in C}{\Vert {v}_{j}^{edge}-P\left({v}_{j}^{{\prime}cont}\right)\Vert }^{2}$$
(5)

where C is the set of paired points located on the edges \({v}_{j}^{edge}\) of the input face image and the corresponding points on the occluding contours \(P\left({v}_{j}^{{\prime}cont}\right)\) on the projected image of the 3D face calculated by means of a K-nearest neighbor algorithm.

The photometric loss function

Measures the pixel-wise distance between the face regions of the input image \(I\) and the 3D face rendered image \({I}{\prime}\). The photometric loss function is as follows:

$$L_{photo}(\mu)-\frac{1}{N_R}{\textstyle\sum_{p\in R}}\Arrowvert I_p-I_p'\Arrowvert$$
(6)

where \(p\) is a pixel on the corresponding face region \(R\), which is segmented by [54] and consists of \({N}_{R}\) pixels.

Personalized face shape models are reconstructed from single images by optimizing the total loss as follows:

$$L\left(\mu \right)={\omega }_{lmk}{L}_{lmk}\left(\mu \right)+{\omega }_{edge}{L}_{edge}\left(\mu \right)+{\omega }_{photo}{L}_{photo}\left(\mu \right)$$
(7)

where \({\omega }_{lmk}\), \({\omega }_{edge}\) and \({\omega }_{photo}\) are corresponding weights for the landmark loss \({L}_{lmk}\), contour edge loss \({L}_{edge}\), and photometric loss \({L}_{photo}\).

3.3 Head shape reconstruction by shape transfer and full head alignment

Although the use of personalized avatars is becoming increasingly important in the fashion industry, most existing platforms and applications only provide human avatars with body shape reconstruction but lack facial shape and appearance reconstruction [1, 2, 5,6,7,8,9]. Although 3DMM-based face reconstruction is a common approach, it cannot be easily applied in this case due to the differences in avatar topology across different platforms. This paper presents a method that transfers the face shapes, reconstructed from 3DMM models, onto the full-head models of any pre-existing human avatars. It provides greater flexibility and convenience as the method is applicable to customize or personalize face of human avatars being used on existing applications/platforms.

3.3.1 Transferring face shapes to head models

To transfer the BFM-based parametric shape \({S}_{f}\) to a full-head model \({S}_{h}\) with different topology, PCA parameters and regression methods are used to establish correspondence between the head model of fashion avatar and the selected 3DMM face (BFM) model. The process begins by randomizing the face shape parameters of \({S}_{f}\) and generating paired synthetic head shape data by nonrigid iterative closest point (NICP) registration algorithm [56] with aligned 3D face landmarks obtained in Sect. 3.1.3 as constraints. Next, the PCA method is used to generate the principal components of the head model.

The head and face 3D morphable models can be represented as follows,

$${S}_{h}={S}_{mean\_h}+{\alpha }_{s\_h}\bullet {P}_{s\_h}$$
(8)
$${S}_{f}={S}_{mean\_f}+{{\alpha }_{s\_f}\bullet P}_{s\_f}$$
(9)

where \({S}_{mean\_h}\) and \({S}_{mean\_f}\) are the mean shape of the head and face models, respectively; \({\alpha }_{s\_h}\) and \({\alpha }_{s\_f}\) are the parameters of the head and face principal components, \({P}_{s\_h}\) and \({P}_{s\_f}\).

Transferring the face shape \({S}_{f}\) to the head model \({S}_{h}\) can be then considered as a regression problem. After shape registration of face and head models, the head and face shape parameters are represented as matrices \({C}_{h}\in {\mathbb{R}}^{{n}_{h}\times r}\) and \({C}_{f}\in {\mathbb{R}}^{{n}_{f}\times r}\), respectively. The face-to-head transformation is solved by a regression matrix [29] as follows

$${T}_{h,f}={C}_{h}{C}_{f}^{T}{\left({C}_{f}{C}_{f}^{T}\right)}^{-1}$$
(10)

where the \({C}_{f}^{T}{\left({C}_{f}{C}_{f}^{T}\right)}^{-1}\) is the pseudo-inverse of \({C}_{f}\). After the regression matrix is obtained, the head shape parameters can be calculated by:

$${P}_{s\_h}={T}_{h,f}{P}_{s\_f}$$
(11)

3.3.2 Complete full head reconstruction

In most 3D full-head models, the eyeballs and hair are usually separated 3D models in addition to the head shape model (as shown in Fig. 4), these models being transformed according to the customized head shape model. To do so, the joint of the head is considered as the point of origin, and the center coordinates \({v}_{c\left\{ \bullet \right\}}\) of each of the eyeballs and hair models are transferred to the local coordinates of the head model. When adjusting the size of these models, the scale factor \({H}_{x,y,z\left\{\text{eyes},\text{hair}\right\}}\) are calculated proportionally in three directions, according to the reference points of the head model \({v}_{i,j}\in \varphi\) and that of the template \({v}_{{i}_{0},{j}_{0}}\in {\varphi }_{0}\), as follows:

Fig. 4
figure 4

Full head reconstruction and transfer. Above: face texture reconstruction and full head texture transfer. Below: full head shape transfer

$${H}_{x,y,z\left\{\text{eyes},\text{hair}\right\}}=\Vert \frac{\underset{i,j\in {\phi }_{\left\{\text{eyes},\text{hair }\right\}}}{\text{max}}{v}_{i\left(x,y,z\right)}-{v}_{j\left(x,y,z\right)}}{\underset{{i}_{0},{j}_{0}\in {\phi }_{0 \left\{\text{eyes},\text{hair }\right\}}}{\text{max}}{v}_{{i}_{0}\left(x,y,z\right)}-{v}_{{j}_{0}\left(x,y,z\right)}}\Vert$$
(12)

Landmarks of the eyes \({\varphi }_{\left\{\text{eyes}\right\}}\) and vertices on the bounding box of cranium \({\varphi }_{\left\{\text{hair}\right\}}\) are used as reference points of the corresponding templates.

To adjust the model location, the corresponding landmarks as well as the center of the bounding box of the hair model are used to compute the translation and rotation matrix.

$${t}_{\left\{\text{ eyes},\text{hair }\right\}}={\left[{v}_{c\left\{\text{ eyes},\text{hair }\right\}}-{v}_{0c\left\{\text{ eyes},\text{hair }\right\}}\right]}^{T}$$
(13)

The transformation of these models is calculated as follows:

$${M}_{\left\{\text{eye},\text{hair }\right\}}=\left(\begin{array}{c}{H}_{x} 0 0\\ 0 {H}_{y} 0\\ 0 0 {H}_{z}\end{array}\right)R\left[\left(\begin{array}{c}{v}_{x}\\ {v}_{y}\\ {v}_{z}\end{array}\right)\right]+{t}_{\left\{\text{eye},\text{hair}\right\}}$$
(14)

3.4 Texture reconstruction

This section describes a method to reconstruct an authentic customized appearance for any pre-existing fashion avatar, which often only has a default texture map with limited ability of personalized representation. For a complete personalized appearance, both facial appearance and full-body texture are personalized (see Fig. 5).

Fig. 5
figure 5

Left: an avatar without texture; center: an avatar without personalized texture, and right: a personalized avatar

3.4.1 Face texture reconstruction

Parametric texture models have low resolution and limited personalized representation ability due to the limited training data and Gaussian nature. To reconstruct a completely personalized face appearance, the facial texture is first sampled from input face image (as described in Sect. 3.1.2) and then combined with a reconstructed coarse albedo map in UV space. To do so, first, the PCA albedo map is used to obtain facial color features by optimizing the BFM facial texture model through Eq. (2). Second, the head mesh was unwrapped into UV space by the cylinder unwarp method, resulting in a full-head UV map. Third, the reconstructed texture (obtained in the first step) was rasterized to the UV map. Fourth, using the face mask obtained from face segmentation in preprocessing step (Sect. 3.1.2), the face texture was sampled from the full-head UV map based on the established correspondence between the reconstructed face and full-head shape (Sect. 3.3.1). Fifth, the sampled textures are then fused with the reconstructed texture through an additive fusion and blurred with a Gaussian kernel of 30-pixel radius, as shown in Fig. 4.

3.4.2 The reconstruction of other texture

This paper also proposes a method to customize the texture of the body (or hair) by means of the color features extracted from the facial skin (or the hair) region as a basis of reference. The method involves extracting color features from the face image and body textures and conducted transferring these features onto the CIELAB color space, also known as lab color space [57]. The color features of the facial skin, defined as the target features, and color features of the body texture, defined as source features, are represented in lab color space by mean value and standard deviation. \(\left({\overline{l} }_{t},{\overline{a} }_{t},{\overline{b} }_{t}\right)\) and \(\left({\sigma }_{t}^{l},{\sigma }_{t}^{a},{\sigma }_{t}^{b}\right)\) represent the mean value and standard variance of the color of the target image in lab space, with \(\left({l}_{s},{a}_{s},{b}_{s}\right)\) and \(\left({\sigma }_{s}^{l},{\sigma }_{s}^{a},{\sigma }_{s}^{b}\right)\) being the source features extracted from the body textures. These features are then used to generate customized body (or hair) textures using Eq. (15) and the process sequence is illustrated in Fig. 6.

Fig. 6
figure 6

Texture reconstruction

$$l=\frac{{\sigma }_{t}^{l}}{{\sigma }_{t}^{l}}\left({l}_{s}-{\overline{l} }_{t}\right),a=\frac{{\sigma }_{t}^{a}}{{\sigma }_{t}^{a}}\left({a}_{s}-{\overline{a} }_{t}\right),b=\frac{{\sigma }_{t}^{b}}{{\sigma }_{t}^{b}}\left({b}_{s}-{\overline{b} }_{t}\right)$$
(15)

3.5 Full-body human avatar reconstruction

Previous research on reconstructing personalized fashion avatars replaced the face of a full body avatar by a 3DMM face model by means of cutting and joining [7]. Unfortunately, this technique leads to the depth compression and distortion of the face model due to the deterioration in the depth information and inaccurate landmarks. Moreover, the texture of the mesh model along the connection regions is not smooth and consistent.

Instead, the head of the human avatar is first cut along the neck plane in the present work, which was reconstructed using the method detailed in Sect. 3.3, further aligned with the body by rigid transformation. The personalized head is then joined to the body model along the boundary, see Fig. 7 for different head poses.

Fig. 7
figure 7

Head joining in different pose

4 Experimental results

This section reports the comprehensive experiments conducted to evaluate the proposed method. First, a comparative analysis was conducted and reported in Sect. 4.1, evaluating the proposed method in comparison with other state-of-the-art methods for face reconstruction. Next, a qualitative analysis, in terms of visual comparison, between the current method and other full-body reconstruction methods is provided in Sect. 4.2, demonstrating the advantage of current topology-free approach for avatar personalization. Lastly, in Sect. 4.3, the impacts of facial landmarks, input image sizes, face pose angles, and fusion methods are further evaluated and discussed.

4.1 Face only comparison

To demonstrate the proposed method can personalize avatars, in terms of facial appearance, based on any in-the-wild face images, experiments were conducted to first evaluate the effectiveness and robustness of face reconstruction. As outlined in the method section, face reconstruction covers both 3D face shape and facial texture reconstruction. There is no available dataset for paired data of in-the-wild face images and 3D face models that can be used for face shape reconstruction evaluation, and most shape evaluation was based on synthetic face models [32, 45, 55]. In this paper, the evaluation on face shape is therefore integrated with texture and appearance using perceptual similarity as metric. In short, different in-the-wild face image dataset were selected and subsampled to reconstruct face models, covering both face shape and texture models, using the proposed method, and also comparing with other state-of-the-art methods for face reconstruction. Face images were rendered from these customized face models, and they are compared to ground truth images, i.e. input face images being used for face reconstruction, to evaluate the perceptual similarity score.

4.1.1 Datasets and metric

Image datasets

Three face image datasets were selected to assess the proposed method in comparison with other SOTA methods for face reconstruction, each dataset has its unique characteristics, representing different image inputs. The CelebA dataset [58], which is widely used to evaluate face reconstructions, contains 10,177 identities and 202,599 in-the-wild face images. For experimentation purpose, a subset of 30 face images, representing different identities, were randomly sampled from this dataset. FFHQ dataset [59], a dataset of high-quality human face images, containing 60,000 images with different resolutions, was used here to evaluate the effect of input image quality. From FFHQ dataset, face images of 30 identities were randomly sampled, each with three resolutions of 256 × 256 pixels, 512 × 512 pixels and \(1024\times 1024\) pixels, resulting in a total of 90 input face images. Facescape dataset [36] has 8,277 individual images captured from various angles and with different facial expressions and poses. This dataset provides a unique view of people’s appearance and the diversity of human facial features, and this dataset was used to evaluate the influence of the different view-angles of input images. A subset of 105 images were randomly sampled, of 7 different view-angles and 15 images of each view-angle was sampled for experiment.

Metric

Face perceptual similarity is a widely used metric, which was also used in face reconstruction studies [18, 60, 61]. It measures cosine similarity of embedding features for face images extracted from a pretrained network model—VGGface [62], which was trained on a substantial dataset comprising 2.6 million face images of 2622 individuals for face recognition purpose. VGGface [62] has achieved an average 97% accuracy in facial recognition, comparable to humans.

For evaluation of face reconstruction, the cosine distance is measured as follows

$$D_{perc}=Cosinesimilarity(I,I')=\frac{(F(I)\bullet F(I'))}{(\Arrowvert F(I)\Arrowvert\times\Arrowvert F(I')\Arrowvert)}$$
(16)

representing the similarity in terms of VGGFace extracted features \(F\left(\bullet \right)\) between a ground truth image \(I\) (i.e. input face image for face reconstruction) and image \(I{\prime}\) of the generated faces. According to Eq. (16), smaller the cosine distance, higher the perceptual similarity between the input face image and rendered image of the reconstructed faces. Since the metric measure similarity at image level, it gives an overall evaluation covering both shape and texture appearance.

4.1.2 Template models and implementation details

For face reconstruction and avatar personalization, the BFM [15] face model and full-body human template models from commercial software of CLO 3D [63] were chosen in the experiment. The BFM face model consists of 35,709 vertices obtained from 3D scans of 100 males and 100 females. The CLO 3D human templates have different topologies, and their face regions also have different topologies as that of the BFM. The male and female templates contain 40,992 and 40,763 vertices, respectively, and their head regions contain 7814 and 8187 vertices, respectively.

The experiment was conducted on a PC with Intel i7 CPU and an NVIDIA Titan GPU, and the python programming environment was used with differentiable rendering of Pytorch3D being used to render the reconstruction model.

4.1.3 Comparative quantitative analysis

The average face perceptual similarity score and the standard deviation were calculated for each evaluation set of face images from the three datasets, and the results are shown in Table 1. It compares to other state-of-art methods for face reconstruction, including BFM [15], DECA [61], and weak-rigid registration [7]. The DECA [61] method realizes a full-head reconstruction based on the TF-FLAME morphable head model [17] and employed the texture model·from the BFM. The weak-rigid registration method [7] fused the BFM results with a coarse head model of SMPL [9].

Table 1 Comparison of perceptual similarity between ground truth and results of various methods: mean consine distance (± standard deviation)

The results being captured in Table 1 shows that, of all the listed methods (2nd to 4th row), the present method has the smallest mean cosine distance (the 5th row of the table with results in bold) between the reconstructed face images and those of the input images, among all three datasets.

Moreover, paired-sample t-tests were conducted to assess whether there are significant differences between the proposed methods and other methods in terms of reconstruction quality. The test results are shown in Fig. 8.

Fig. 8
figure 8

Comparisons of faces reconstructed by (a) BFM method; (b) weak-rigid registration method; (c) DECA method; (d) the proposed method using a BFM head model; (e) the proposed method using a TF-Flame head model; (f) the proposed method using a SMPL head model. There are significant differences between results of (d) and (a) at \(p\le 0.001\), (d) and (b) at \(p\le 0.001\), (d) and (c) at \(p\le 0.05\), between (e) and (a) at \(p\le 0.001\), (e) and (b) at \(p\le 0.001\), (e) and (c) at \(p\le 0.001\), and between (f) and (a) at \(p\le 0.001\), (f) and (b) at \(p\le 0.001\), (f) and (c) at \(p\le 0.05\), while there is no significance difference among (d), (e), and (f)

4.1.4 Visual comparison of reconstructed faces

The above quantitative comparison shows that the proposed method has significantly better performance than other face reconstruction methods, across different types of input face images. In addition to the above statistical quantitative evaluation, here a qualitative comparison is given in terms of visual comparison some face reconstruction results are sampled and presented in Fig. 9.

Fig. 9
figure 9

Comparison of different face reconstruction methods: (a) input images, outputs from (b) BFM [15], (c) DECA [61], (d) Weak-rigid registration [7] and (e) the present method

It is apparent from Fig. 9 that the texture of the BFM [15] method has relatively less detail, such as in the eyes, and with a lower resolution due to the nature of the PCA-based texture models. This can lead to a low individual-identity of the facial representations.

In column(c) of Fig. 9, the DECA [61] method achieved a full-head reconstruction, realizing head shape reconstruction. Nevertheless, the face appearance of the DECA reconstructed result lacks details and does not preserve well the identity of the input images. This is because the DECA method utilizes BFM texture model as its face appearance model while the BFM texture model, trained from 100 males and 100 females, only generates limited diversity appearance.

The reconstructions resulting from the weak-rigid registration algorithm [7] are shown in column (d) of Fig. 9. It shows some inconsistency in the results between the face and head regions, and it is caused by the fusion of the BFM results with a coarse head model of SMPL [9], and distorted face model during the registration process. Moreover, since the texture is based on the BFM model, with basically the same limitations of BFM.

The reconstructions based on the present method are shown in column (e) of Fig. 9, from which it is apparent that the reconstructions given by the present method have better preserved individual-identity than BFM, and the connection between the face and head is smoother than that of weak-rigid registration [7]. These qualitative results show that the present method can reconstruct good facial shape and appearance, and the method appears to be highly suitable for fashion avatar personalization.

4.2 More results on avatar personalization by full-head customization

4.2.1 Comparison of personalized avatars

A comparative evaluation of personalized avatars provided by different methods, including SMPL-X [11], weak-rigid registration [7] to simple full-body model of SMPL [9] and the proposed method was undertaken, where the two selected methods are most relevant to the task of avatar personalization. Figure 10 shows qualitative results of personalized avatar by different methods with the same input images. As shown, SMPL-X method [11] can reconstruct a human model from a single image but with limited facial detail and without a full-body texture. The weak rigid registration method [7] can reconstruct a personalized human model, but the face is distorted and the skin texture is inconsistent between the face and the body. In contrast to the above, the present method can personalize a human avatar with greater detail and consistent skin texture for face and body. More examples are given in Fig. 13 as well as in Fig. 1, demonstrating that the present method can create avatars with different skin colors and facial shapes.

Fig. 10
figure 10

Comparison of various full-body reconstruction methods: (a) input image, (b) SMPL-X [11], (c) weak-rigid registration [7] and (d) the present method

4.2.2 Avatars with head models of different topologies

To demonstrate the present method being able to apply to other body or head models of arbitrary topologies, experiments were repeated using the same datasets of input face images, with different head models, including the head portions of full-body models of SMPL [9] and SMPL-X [11]. The TF-Flame full-head [17] was used in full-body model SMPL-X [11], which has a more complex structure with independent eyes, hair and head mesh components, while SMPL [9] has a simple structure with only one head mesh. The results are given in 6th and 7th rows of Table 1, right below to the results of BFM in the 5th row. Compared to high-resolution model of BFM with 35,709 vertices for face only region, the SMPL is a sparse model with head region of only 1324 vertices, and the TF-Flame head model contains 5023 vertices. The reconstruction pipeline on SMPL only focuses on head shape and texture reconstruction, with an average computing time around 45.10 s. The TF-Flame model contains relatively more detail than SMPL, and the reconstruction pipeline of TF-Flame is the same as the present method, requiring an average computing time of around 65.50 s. Table 1 shows that similar reconstruction performance is achieved when changing 3DMM from high-resolution BFM model to other 3DMM of SMPL [9] and TF-Flame full-head [17]. Moreover, pair sample t-tests of Fig. 8 showed that the reconstruction results of present method based on SMPL [9] or TF-Flame full-head [17] are significantly better than that of other state-of-the-art methods [7, 15, 61]. Figure 11 compares qualitative reconstruction results of FFHQ [59] dataset based on template head models of different topologies. It is evident from both the qualitative and quantitative evaluations that the reconstructions based on head models, even on sparse models, can still accurately capture individual identities and exhibit high level of reconstruction quality. Nevertheless, for the extreme sparse model, the performance deteriorates, as illustrated in Fig. 12.

Fig. 11
figure 11

Applying reconstruction method on different head templates;(a) input images,(b) head from TF-Flame [17] and (c) head from SMPL [9]

Fig. 12
figure 12

Different resolutions of head models: left – head mesh contains 1324 vertices, right – head mesh contains 5023 vertices

4.2.3 Application of personalized 3D avatars for virtual try-on

This section demonstrates using personalized avatars for a virtual try-on fashion application. To showcase the virtual try-on application, some high-quality full-body images were purchased and downloaded from website ‘motion array’ [64], and the face images were then cropped from these high-quality images for personalizing avatars. Figure 13 shows the try-on results of the same sized clothing being worn by personalized male and female avatars reconstructed from these images by commercial software CLO3D. It can be shown that people with varying heights, skin color and appearance can be customized to display the different try-on effects. This technology has been applied in a mobile application for fashion try-on, and a video is provided in the supplementary material.

Fig. 13
figure 13

Personalized avatars in try-on application

4.3 Discussions of different reconstruction options

This section discusses the importance of landmark detection, one of the image preprocessing steps, and it also assess the impacts of input face image size, face angle, and fusion method being used for texture reconstruction.

4.3.1 Landmark detection

Landmark detection plays a crucial role in the current study because the detected landmarks are used in both face and head reconstruction as well as in facial texture alignment. It is interesting to investigate how the accuracy and robustness of the landmark detection may impact on the final performance of the avatar customization. This section compares three mainstream landmark detectors, including Fan et al. [44] and Dlib [46], and PiPNet [65], and the corresponding reconstruction results are given in Fig. 14. For the method of Fan et al. [44], it covers 2D landmarks (denoted as FAN 2D) and 3D landmarks detected (denoted as FAN 3D). As shown, the landmark can cause misalignment on reconstructed head model that the eyes regions on rendered image are located on different places. As shown in Fig. 14, both detectors [44, 46] exhibited slight location errors, leading to misalignment in the reconstructed UV texture. To quantitatively measure the errors caused by misalignment, MSE was calculated to measure the distance between the feature points on rendered face with texture and projected from annotated points on 3D head mesh. The MSEs of PIPNET, dlib, FAN2D, FAN3D are 0.95, 1.83, 2.29, and 3.08, respectively. Comparatively, Dlib [46] and PiPNet [65] predict precise landmarks for the eyes with correct and well aligned facial texture. In summary, a good level accuracy and robustness in facial landmarks is sufficient for face shape reconstruction while a high level of accuracy is required for facial texture alignment.

Fig. 14
figure 14

Reconstruction results based on different landmark detection methods. (a) landmarks detected from face image by PiPNet [65], Dlib [46], FAN2D and FAN3D models [44]; (b) local details of landmarks on eyes region; (c) reconstructions based on the corresponding landmarks showing both head and eye meshes; (d) reconstructions results showing head models only where errors in landmark detections affect texture alignment

4.3.2 Input image size

High resolution input images can provide a greater amount of information on facial appearance, but also consumes more storage and computing resources. Hence, it is needed to examine the effect of image sizes on reconstruction appearance and resource utilization so as to suggest the image resolution that best balance processing cost and output quality. The reconstructed appearance is evaluated by means of the perceptual similarity distance metric, \({D}_{\text{prec}}\), as defined in Eq. (16). Table 2 compares the resulting output quality in terms of mean and standard deviation (MSD) of the perceptual similarity distance as well as the processing time for different input image sizes. The results show that the perceptual similarity distance decreases, meaning better output appearance, when the image size increases. Nevertheless, as expected, the computing time also increases, although the distances among these resolutions are not significant (Fig. 15). Hence, the input images with \(512\times 512\) pixels in size are recommended, as this size provides good perceptual similarity for the reconstruction at the same time takes into account of the computational efficiency.

Table 2 Comparison of different input image resolutions and the corresponding reconstructed appearance quality and processing time
Fig. 15
figure 15

Comparisons of difference in perceptual similarity among different resolutions of input images: no significance at \(p\ge 0.05\)

4.3.3 Input face angles

The accuracy of the face reconstruction can be affected by the face angles of the input images. This is because that the frontal images contain a greater amount of facial appearance information compared to profile images. On the other hand, facial landmark detectors are in general more robust when faces are in frontal poses [44]. The evaluation on the effect of the face angles of the input images was based on the Facespace multi-view image dataset [36], selecting images from five yaw-angle ranges \(\left(\left[-50, -30\right],\left[-30,-10\right],\left[-\text{10,10}\right],\left[\text{10,30}\right],\left[\text{30,50}\right]\right)\), cropped and resized into \(512\times 512\) pixels, with the mean and standard deviation (MSD) of the perceptual similarity distance \({D}_{\text{prec}}\) for images with different face angles being used as criteria. The resulting MSD of \({D}_{\text{prec}}\) values for images of the five angles are \(0.40\pm 0.05, 0.37\pm 0.08, 0.31\pm 0.05, 0.34\pm 0.06, 0.42\pm 0.07\), respectively, from which it is apparent that large face angles lead to a decrease in the accuracy of face reconstruction. Figure 16 shows that there are statistically significant differences in the face reconstruction quality among different face angles of \(\pm \left[\text{0,10}\right],\pm \left[\text{10,30}\right],\pm \left[\text{30,50}\right]\).

Fig. 16
figure 16

Comparisons of difference in perceptual similarity among different face angle ranges: significantly different between \(\left(\pm \left[\text{0,10}\right] and \pm \left[\text{10,30}\right]\right)\) at \(p\le 0.005,\left(\pm \left[\text{0,10}\right] and \pm \left[\text{30,50}\right]\right)at p\le 0.001\),\(\left(\pm \left[\text{10,30}\right] and \pm \left[\text{30,50}\right]\right)\) at \(p\le 0.05\)

4.3.4 Fusion methods

Experiments were conducted to evaluate different fusion kernels and fusion methods for face texture generation. In terms of fusion kernel sizes, different kernel sizes ranging from 10 to 30 pixels were tested, and 30 pixels was selected as the final fusion kernel, as shown in Fig. 17. For the fusion method, both the Poisson fusion and addictive fusion methods were tested, and Fig. 17 shows that the addictive fusion method is more suitable for the generation of facial texture. In the Poisson fusion method, hair bang and other occlusion will decrease the segmentation accuracy. On the basis of both robustness and effect, addictive fusion method was selected here for final reconstruction.

Fig. 17
figure 17

Comparative ablation study of fusion methods

5 Conclusions and future work

Personalized 3D human avatars have aroused a great deal of interest because it is attractive to most people, particularly the generation z, to have the digital twins in their own appearance to live, work, interact, and shop in the metaverse. Recently, research endeavors on image-based reconstruction methods were reported for personalizing human models, mainly focusing on body shape modelling with facial part either being neglected [2, 8,9,10] or else only have coarse facial reconstructed [11].

This study introduces a method to transfer reconstructed facial shapes and appearances to full-body avatars with diverse topologies for various fashion applications. This method can reconstruct accurate face shape and high-resolution facial appearance. When compared to the other methods which replace face mesh with rigid registration, the new method not only preserves individual-identity facial appearance but is also consistent in terms of texture and shape.

Nevertheless, this method has a few limitations. First of all, the reconstruction of hair style is not covered. Second, this method can only reconstruct static model at present, which cannot be animated to present dynamic expressions. Furthermore, its performance will be adversely affected if the avatar has a head mesh with very few vertices. Moreover, the appearance realism of reconstructed faces will be adversely affected if presence of strong shading or makeup in the input face images. As far as the future work is concerned, it is planned to improve the appearance model by separating texture and illumination from input images, personalizing the shapes and colors of the hair models. It is also planned to extend the method to animated avatars, such as introducing face joint-skeleton or blend shape. Lastly, it will be also the future work to reconstruct face models based on video input and improving face shape accuracy based on multi-view images for real-time reconstruction.