Personalized hairstyle and hair color editing based on multi-feature fusion

Xu, Jiayi; Zhang, Chenming; Zhu, Weikang; Zhang, Hongbin; Li, Li; Mao, Xiaoyang

doi:10.1007/s00371-024-03468-2

Personalized hairstyle and hair color editing based on multi-feature fusion

Research
Open access
Published: 29 May 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

The Visual Computer Aims and scope Submit manuscript

Personalized hairstyle and hair color editing based on multi-feature fusion

Download PDF

Jiayi Xu¹,
Chenming Zhang¹,
Weikang Zhu²,
Hongbin Zhang¹,
Li Li¹ &
…
Xiaoyang Mao³

180 Accesses
Explore all metrics

Abstract

In the metaverse era, virtual design of hairstyle becomes very popular for personalized aesthetics. As hair design tasks can be decomposed into hair attribute editing and generation, the development of generative adversarial networks (GANs) has significantly prompted its development. The majority of the existing algorithms focus on transferring the overall hair region from one face to another, which ignore fine control over the color and geometric features. Furthermore, these algorithms may result in unnatural generation results. In this paper, we propose a hair modification framework that learns hairstyle information from a reference face mask and color information from a guidance face image. Firstly, the features of the input face image and reference images are extracted through a group of encoders, and then divided into feature vectors of coarse, medium, and fine levels. Secondly, multi-level feature vectors are fused in the latent space using attention-based modulation modules. Finally, the fused feature vector is passed through a StyleGAN generator to generate face images with specified hairstyle and hair color. Experimental results show that the proposed method can finely simulate the hairstyle transition between long and short hair under the constraint of the reference mask, and can produce realistic fusion effects in the hair-covered regions, such as ears, neck, and forehead. Various hair dyeing effects that adapt to personalized characteristics are demonstrated, as facial features including skin color and hair texture are preserved when transferring the hair color.

HSSAN: hair synthesis with style-guided spatially adaptive normalization on generative adversarial network

Article 11 July 2023

GAN with Multivariate Disentangling for Controllable Hair Editing

Style Your Hair: Latent Optimization for Pose-Invariant Hairstyle Transfer via Local-Style-Aware Hair Alignment

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In real life and virtual social circumstances, the demand for hairdressing is everywhere. For example, when a person discovers a favorite hairstyle and hair color, but not sure if it is suitable for the makeup, he or she may resort to a virtual hair design and recommendation system, through which he or she needs only to input a frontal facial photograph and then select the suitable hairstyle and color from a set of generated portraits.

In recent years, many deep learning-based methods have explored hairstyle editing tasks. Some methods use sketches [27] or masks [11, 18, 20, 30] as constraint or reference to transfer hairstyle from one to another. Sometimes, the regenereated hair may not naturally fit the face or background as the hairstyle or color is restrictly migrated without proper modification. Text-based methods [16, 24] which extracted features from both text and images using a pre-trained Contrastive Language Image Pre-Training (CLIP) [17] encoder are usually user-friendly, but precise control on color and shape is hard to achieve due to the coarse-grained features obtained from the text descriptions. In general, most of the existing methods perform well in the hair attributes editing task, but how to achieve personalized hair modification while preserving irrelevant facial features untouched is still a challenging and important work.

User provided hair mask is used as the hairstyle modification constraint, while another face image with desired hair color is used as the hair color modification constraint. To achieve precise hair editing results, hair attributes are decoupled into hairstyle and hair color in the latent space. The feature vectors are divided into three hierarchical levels, namely coarse, medium and fine as proposed by StyleCLIP [16]. The features of reference hairstyle and hair color are fused into the latent vectors in the latent space, to better fit the modified hair according to the face shape, background and other parts of the face. More specifically, we construct three encoders to extract features from multiple input images and combine different levels of feature with the proposed modulation module. By inputting this feature vector into a pre-trained StyleGAN [9], a new face image with expected hair attributes is generated. Some editing examples are provided in Fig. 1.

The main contributions of this work are summarized as follows:

We construct a hair editing framework consisting of three feature encoders, three latent space feature vector modulation modules and a pre-trained StyleGAN generator. The modification is focused on hairstyle and hair color without affecting other facial attributes, and desired hairstyle and hair color can either be combined or individually transferred from reference images to the target face fastly.
Background loss, hair color loss, normalization loss, identity loss, and hairstyle constraint loss functions are incorporated to keep balance between the preservation of facial attributes and the modification of hair attributes, ensuring precise editing effects within hair region and natural blending results with the background.
We design modulation modules based on attention mechanisms to divide the latent space feature vectors into coarse, medium, and fine levels. The proposed algorithm achieves cross-channel fusion of feature vectors and realizes accurate and detailed hairstyle and hair color modification from multiple reference images.

2 Related work

2.1 GAN-based face image editing algorithms

Since the introduction of GAN models, the quality and precision of generated facial images have significantly improved. In 2014, Mirza et al. [14] proposed the conditional GAN (CGAN), which replaced random noise input with conditional input, enabling controlled facial image generation. In 2017, Isola et al. [7] introduced the pix2pix model based on the concept of CGAN that can transfer the features of the reference image to the target image. Furthermore, Zhu et al. [29] developed the CycleGAN model based on the pix2pix framework and achieved cross-domain facial attribute transfer effects through a cyclic generation network. To generate higher resolution images, Wang et al. [23] proposed the pix2pixHD by refining the pix2pix model. Park et al. [15] presented a Spatially-Adaptive Denormalization (SPADE) network by utilizing spatially adaptive normalization modules, to generate realistic images from user-specified semantic masks. Moreover, MaskGAN [13] focused on facial image generation based on fine-grained semantic masks, enabling interactive facial image editing. Although CGAN-based face editing algorithms can perform hairstyle and hair color editing partially, the quality of the generated images is unsatisfactory. Additionally, due to the development of StyleGAN network [9], the quality of generated faces has been greatly improved. Downstream tasks of editing faces by modifying the latent space of a pre-trained StyleGAN have been extensively explored. Based on the pre-trained StyleGAN, the existing approaches mapped the input conditions and noises into different levels of the latent to realize the generation diversity [4, 6, 19, 26]. As a special case of local face editing, hair editing can also be achieved by manipulating the feature vector of the StyleGAN latent space.

2.2 Hair editing algorithms

For more challenging attribute control on hair editing tasks, Tan et al. [20] introduced CGAN-based MichiGAN, which used multiple conditional inputs to achieve hair transfer. The hair information was decomposed into four attributes as input: shape, structure, appearance, and background. However, the coupling between facial attributes results in unnatural generation when arbitrary hairstyle modifications are performed. Saha et al. [18] proposed a two-stage hair transfer framework called LOHO, which achieved hairstyle transfer effects by optimizing the latent feature space of StyleGAN2 [10]. However, the generated hair does not fit the face contour and image background well. Considering hair editing as an attribute transfer task, Wei et al. [24] proposed HairCLIP to modify the hairstyle and hair color. This approach introduced the CLIP model [17] to achieve text-to-image transformation. By inputting the guiding hairstyle in the form of an image or text description, modulation mechanism for style fusion is used to realize hair modification. Due to the coarse feature representation ability of text description, this method can only roughly constrain the hair change. Furthermore, Zhu et al. [30] utilized a semantic mask as a hairstyle reference in their work called Barbershop. They expanded the embedded $W^+$ latent space into a proposed FS space, encoded more fine-grained spatial information and further improved the visual quality of the outputs. Moreover, kim et al. [11] proposed StyleYourHair model based on Barbershop model [30]. The spatial alignments of hair and face were achieved by introducing a local style matching loss in the network training.

Inspired by the modulation concept proposed by HairCLIP [24], we design a personalized hairstyle and hair color editing system based on a pre-trained StyleGAN model. Additionally, we design a multi-feature extraction and fusion framework that incorporates both the hairstyle and color features from different reference images, to accurately control the editing of hairstyle and hair color. The proposed method achieves natural hair blending effects with the input face and background.

3 Proposed method

Based on the idea of modifying hairstyle using semantic facial masks, guiding hair color transfer with hair from the reference image, while retaining other features from the original face. This paper proposes a hairstyle and hair color editing model based on multi-feature fusion and modulation. The framework mainly consists of three parts, as shown in Fig. 2.

Firstly, the multi-modal feature extraction module with three encoders is built, which includes a face encoder to preserve input facial features from original face image, an encoder to extract features from hair shape masks based on semantic segmentation, and an encoder to extract hair color information from hair color reference images. Mapping three types of features into the same latent space helps in adjusting and combining hairstyle, hair color, and original facial features according to the user’s needs. Secondly, the attention-based latent space feature decoupling and fusion module is constructed. Based on the idea of StyleGAN latent space hierarchical feature division, hairstyle and hair color are decoupled from other facial features at coarse, medium, and fine levels of granularities. By adding the attention mechanism into the modulation process, accurate extraction and mixture of independent hair features is achieved. Finally, the modulated hybrid latent space feature vector is used as the regeneration input, and a pre-trained StyleGAN2 model is employed to regenerate face images with the desired hairstyle and color.

3.1 Construction of dataset

A training dataset based on the CelebAMask-HQ [8] dataset is constructed. The CelebAMask-HQ consists of more than 30,000 face images with 512$\times $512 resolution. Each image in the dataset has a matching mask which contains 19 facial semantic categories(i.e., hair, facial features, background, etc.). Additionally, CelebAMask-HQ dataset contains only common hair colors (i.e., blonde, brown, and black) and there are few available images with other hair colors.

The data required for the training can be classified into three types: the original input face image, hairstyle reference mask image, and hair color reference image, as shown in Fig. 2. From CelebAMask-HQ dataset, we select 20,000 images only as original input face images. Furthermore, we collect 1000 annotated segmentation mask images for hairstyle reference mask images. Moreover, we collect 11 hair colors from the Internet, such as: gold, brown, blue, purple, red, pink, and white, and then expand them as shown in Fig. 3, to overcome the limited data of the hair colors. Besides, 3000 facial images with different hair colors are used as hair color reference images in this paper.

The hair color expansion is implemented based on the idea of style transfer, encoder-decoder architecture that uses the semantic facial mask to constrain the hairstyle generation, and hair color reference image to constrain the hair color generation. The encoder uses ResNet50 [3] as the backbone network, extracts multi-layer features based on the pyramid structure [12], and stacks them to obtain color-related style feature vectors. The SPADE module [15] is introduced into the ResNet50 network, and a hairstyle mask image is fused into each SPADE block as a shape guidance to obtain feature vectors containing hairstyle information. The decoder uses a pre-trained StyleGAN2 generator to decode the combined feature vector obtained by fusing color and style control feature vectors, and finally generates a new face image with the desired hairstyle and hair color.

3.2 Multi-model feature extraction

In this subsection, we aim to accurately extract multiple features that guide the generation of new hair. We propose a group of three parallel encoders including: an original input encoder that keeps most of the face features and prevents irrelevant hair regions from being modified, a shape-constrained hairstyle encoder that uses a hairstyle mask as the reference, and a hair color encoder that migrates hair color from a reference face to the original input face image. We employ the encoder4editing (e4e) [21] network as a basic encoder structure, which performs well in the GAN inversion tasks. More specifically, the e4e-based encoders generalize $W^+$ space to $W^k_*$ space of StyleGAN, where $k=18$ denotes the dimension of the output feature vectors. Based on the StyleGAN hierarchical levels, the latent space feature vector w($w_1 \sim w_{18} \in \mathbb {R}^{512}$) is divided into sub-vectors of $w_1 \sim w_3$, $w_4 \sim w_7$, and $w_8 \sim w_{18}$ in the hierarchical levels of StyleGAN, that represent coarse level features $w_c$, medium level features $w_m$, and fine level features $w_f$, respectively.

The original input encoder and hair color encoder work under the real image domain, therefore, a pre-trained e4e encoder is used in this work. For the hairstyle encoder, as the semantic facial mask which belongs to a non-real image domain is used as an input, it is necessary to train it and align all of the three encoders into the same feature space. The hairstyle encoder is retrained on the Flickr-Faces-HQ dataset (FFHQ) [9]. The generation results with the facial mask and noise as inputs are shown in Fig. 4. Overall, we can see that the face shape information is maintained with diverse generation results.

3.3 Attention-based latent space feature decoupling and fusion

Following the idea of CGAN, the attribute editing effect can be realized through the modification of the latent space feature vector of the face image. However, due to the entanglement of facial features, the clear modification of hair area can hardly be achieved by simple interpolation or optimization of the latent vectors. To solve this issue and achieve accurate hair editing, we propose a hierarchical multi-feature modulation mechanism to achieve accurate hair editing, which can effectively weave the hairstyle and color constraints into the facial regeneration procedures.

The modulation step has three parallel modules, $M_c$, $M_m$, and $M_f$ that output a control vector $\varDelta w'$ after fusing the original face image, hairstyle reference mask image, and hair color reference image as shown in Fig. 5. ECA module is added into the modulation module proposed by HairCLIP [24] to finetune the weights of the feature vectors at each level with better fusion results. The coarse level latent space sub-vector of the original face image $w_c$ and the style reference mask image $w^s_c$ are combined with the modulation module $M_c$ to get a new sub-vector $\varDelta w_c$. The medium level latent space sub-vector of the original face image $w_m$ and the style reference mask image $w^s_m$ are combined with the modulation module $M_m$ to get a new sub-vector $\varDelta w_m$. The hairstyle revision is mainly controlled by $\varDelta w_c$ and $\varDelta w_m$. The fine level latent space sub-vector of the original face image $w_f$ and the color reference image $w^c_f$ are combined with the modulation module $M_f$ to get a new sub-vector $\varDelta w_f$ to control the hair color change. The final control vector $\varDelta w'$ is computed by concatenating the modulated sub-vectors $\varDelta w_c$, $\varDelta w_m$, and $\varDelta w_f$ through $F_{cat}()$ as expressed in Eq. (4):

$$\begin{aligned} \varDelta w_c=\lambda _c\times w_c+(1-\lambda _c)\times w_c^s \end{aligned}$$

(1)

$$\begin{aligned} \varDelta w_m=\lambda _m\times w_m+(1-\lambda _m)\times w_m^s\ \end{aligned}$$

(2)

$$\begin{aligned} \varDelta w_f=\lambda _f\times w_f+(1-\lambda _f)\times w_f^c \end{aligned}$$

(3)

$$\begin{aligned} \varDelta w'=F_{cat}(\varDelta w_c,\varDelta w_m,\varDelta w_f) \end{aligned}$$

(4)

where coefficients $\lambda _c$, $\lambda _m$, and $\lambda _f$ are used to adjust the control effect of the coarse, medium, and fine levels, respectively. The ultimate latent space feature vector of the modified face image $w'$ is calculated by element-wise addition of the final control vector $\varDelta w'$ and latent vector of the original face w and then normalizing the result.

As shown in Fig. 5, each modulation module includes a basic block of a fully connected layer, a modulator, a LeakyReLU layer, and an ECA attention layer [22]. $M_c$ and $M_m$ have five basic blocks, and $M_f$ has ten basic blocks. The calculation process of the modulator can be expressed as in Eq. (5):

$$\begin{aligned} x^{\prime }=(1+f_\gamma (e))\frac{x-\mu _\chi }{\sigma _\chi }+f_\beta (e) \end{aligned}$$

(5)

where x represents the output of the previous modulator, $x'$ represents the output of the current modulator, $w^s$ is the latent vector of the hairstyle reference mask image, and $w^c$ is the latent vector of the hair color reference image. $\mu _x$ and $\sigma _x$ represent the input mean and variance of the corresponding modulation layer, respectively, to perform the normalization operation. $f_{\beta }$ and $f_{\gamma }$ are fully connected layers. It is possible to realize independent modification on the hairstyle or hair color by setting the $w^c$ or $w^s$ as null and disabling the corresponding modulator in each basic block.

To better integrate these features, ECA [22] (Efficient Channel Attention) is introduced into the modulation module. The structure of ECA is shown in Fig. 6. To ensure that the attention module is dimensionally matched to the input features, an adaptive convolutional kernel is adopted to learn the importance of different channels. A larger convolution kernel is used in layers with more channels to allow more cross-channel feature fusion, while a smaller convolution kernel is used in layers with fewer channels to allow less cross-channel feature fusion. The adaptive convolutional kernel is defined in Eq. (6):

$$\begin{aligned} k=\psi (C)=|\frac{\log _2(C)}{\gamma }+\frac{b}{\gamma }|_{odd} \end{aligned}$$

(6)

where k denotes the convolutional kernel size, C denotes channel dimension, $|z|_{odd}$ means the nearest odd number to z. In this paper, the variables $\gamma $ and b, which are used to change the ratio between channels C and the convolutional kernel size k, are set to 2 and 1, respectively.

3.4 Loss function

The proposed framework consists of three encoders, connecting three feature modulation modules, and a pre-trained StyleGAN generator. The three encoders are pre-trained as described in Subsection 3.2. The three feature modulation modules are trained using five loss functions including: hairstyle constraint loss $L_{hairstyle}$, background loss $L_{bg}$, hair color loss $L_{color}$, normalization loss $L_{norm}$, and identity preservation loss $L_{id}$.

3.4.1 Hairstyle constraint loss

To generate a new face with a desired hairstyle, the target hair shape is obtained from the hair region of hairstyle reference mask image $I_{mask}$, while the hair region of the generated face $I_{new}$ is allocated by a pre-trained face segmentation network [28]. $D_{hair}()$ is the density distribution map that is used to indicate the probability of a pixel in the hair region. The $L_{hairstyle}$ is calculated by measuring the pixel-wise coincidence between $I_{mask}$ and $I_{new}$ as defined in Eq. (7):

$$\begin{aligned} \begin{aligned}L_{hairstyle} = || D_{hair}({I_{new}}) \times ( 1 - I_{mask})\\ - D_{hair}({I_{new}}) \times {I_{mask}}||_2\end{aligned} \end{aligned}$$

(7)

where ${I_{mask}}$ denotes the hair region in the hairstyle reference mask image, $D_{hair}({I_{new}})$ denotes the density distribution of the hair region in generated face image. The equation is calculated in the form of L2-norm, where a low value of $L_{hairstyle}$ means a small difference between the hair shape of the reference and modified images.

3.4.2 Background loss

The background is defined as the whole image subtracting the segmented hair region. The intersection region of the background of the original face image and the generated face image is calculated, and named as $B_{ori}$ for the original image $I_{ori}$, $B_{new}$ for the generated image $I_{new}$, respectively. Background loss $L_{bg}$ is designed to keep the irrelevant hair region unchanged, as shown in Eq. (8):

$$\begin{aligned} L_{bg}=|| B_{ori} - B_{new}||_2 \end{aligned}$$

(8)

3.4.3 Hair color loss

The respective hair area between the original face image and generated face image is calculated, and named as $H_{ori}$ for the original image, and $H_{new}$ for the generated image, respectively. Hair color loss $L_{color}$ is defined to measure the difference between the hair color of the reference image and the generated image by computing the pixel-wise L2-norm.

$$\begin{aligned} L_{color}=|| H_{ori} - H_{new}||_2 \end{aligned}$$

(9)

3.4.4 Normalization loss

As hair color is influenced by fine level latent feature sub-vector, while hairstyle-related geometry structure is mainly related to coarse level latent feature sub-vector, we apply normalization loss to restrict feature change in the medium level of latent space as described in Eq. (10).

$$\begin{aligned} L_{norm}=||\varDelta w^{ori}_m - \varDelta w^{new}_m||_2 \end{aligned}$$

(10)

where $\varDelta w^{ori}_m$ is the medium level latent feature sub-vector of the original face image, $\varDelta w^{new}_m$ is the medium level latent feature sub-vector of the generated face image.

3.4.5 Identity preservation loss

To further preserve the original face features, we focus on the preservation of the identity features. By utilizing a pre-trained ArcFace [2] face recognition network, 50-dimensional identity feature vectors are extracted from both the original face and the generated face. The cosine distance cos() is employed to calculate the similarity between two identity features x and y. The identity preservation loss $L_{id}$ is defined as follows:

$$\begin{aligned} L_{id}=1-cos\left( R(x),R(y)\right) \end{aligned}$$

(11)

where R() is the identity feature extraction function. In summary, the overall loss function is defined as follows:

$$\begin{aligned} L{} & {} = \lambda _{hairstyle}L_{hairstyle} +\lambda _{color}L_{color}\nonumber \\ {}{} & {} \quad +\lambda _{id}L_{id} +\lambda _{bg}L_{bg}+\lambda _{norm}L_{norm} \end{aligned}$$

(12)

where $\lambda _{hairstyle}$, $\lambda _{color}$, $\lambda _{id}$, $\lambda _{bg}$, $\lambda _{norm}$ are hyperparameters. By combining these loss functions into a comprehensive framework, the carefully tuned hyperparameters enable us to strike a balance between preserving face features and achieving the desired level of hair modification.

4 Experiments and analysis

To validate the efficiency of the proposed model, qualitative and quantitative experiments are conducted. Furthermore, we compare our hair design system against state-of-the-art (SOTA) approaches. Moreover, we perform comprehensive ablation experiments to verify the necessity and contribution of each loss function within our proposed framework.

4.1 Implementation details

We conduct all the experiments on the extended CelebA

Mask-HQ dataset as introduced in Subsection 3.1. The proposed framework is trained on the NVIDIA GeForce RTX 3090 GPU with 24 G memory. During the training, the Ranger [25] optimizer is used, with an initial learning rate of 0.1. After every 10,000 training iterations, the learning rate decays to 0.95 times. The StyleGAN2 [10] generator pre-trained on the FFHQ dataset [9] is used as our generator. For the hyperparameters of the loss functions, we set $\lambda _{hairstyle}=1$, $\lambda _{color}=0.01$, $\lambda _{id}=0.4$, $\lambda _{bg}=1$, $\lambda _{norm}=0.8$. For the hyperparameters of the modulation modules, we set $\lambda _c=0.7$, $\lambda _m=0.9$, $\lambda _f= 0.2$.

4.2 Qualitative and quantitative evaluations

This research focuses on accurate transfer of hair attributes while ensuring natural hair blending with the face and background. Therefore, we conduct a series of experiments to show the efficacy of the proposed model on modifying hair color and style. Hairstyle and color design results of males are given in the left half of Fig. 7, and the results of females are given in the right half. We can notice that, although the reference hair color and hairstyle images may have different genders and key facial attributes compared with original face (the top right image in each example), the generated face image (large image in each example) achieves accurate hair modification result while maintaining stable identity information. Due to the hairsytle and face contour differences between the input face and the reference face, a strict or adaptive hairstyle transfer effect can be achieved by adjusting the hyperparameters of loss functions. In current implementation, stringent hair shape transfer is employed which may sometimes result in less natural blending effect as shown in Fig. 1.

The proposed method can achieve the effect of modifying the hairstyle and hair color independently or at the same time. The experimental results of changing only hair color are given in Fig. 8. The proposed method can correctly migrate rare colors and get natural hair texture effect by training on the augmented hair dataset. The experimental results of changing only hairstyles are given in Fig. 9. It can be seen from the results that the regenerated hair can fit face contour well, and blend with the background in a coherent way. Furthermore, we conduct experiment on modifying hairstyles with different hair lengths as shown in Fig. 10.

In this experiment, mask encoder efficiently extracts shape features, and coarse and medium level vectors are modified to constrain the modification scope. Consequently, when shorting the hair length, the modified hair region blends well with the background, as indicated by the red dashed box in Fig. 11.

To quantitatively investigate whether the proposed method can accurately learn the reference hairstyle, we compare the similarity between hair regions of the output face image $m_1$ and input face image $m_2$ using intersection over union $M_{IoU}$ as defined in Eq. (13):

$$\begin{aligned} M_{IoU}=1-\frac{1}{N}\sum _{\begin{array}{c}i=1\\ \end{array}}^N\Vert m_ 1-m_2\Vert _1 \end{aligned}$$

(13)

The pre-trained face parsing network BiSeNet [28] is employed to allocate the hair mask region of a face image. Pixel-wise L1-norm between the two hair mask regions is calculated. We randomly sample two face images and compute hair masks using different approaches as shown in Fig. 12. The corresponding $M_{IoU}$ values of manually labeled $m_{Manu}$ and BiSeNet allocated $m_{Ref}$ compared to the BiSeNet allocated output mask $m_{Output}$ are listed in Table 1. We can see that, the hair shape and the $M_{IoU}$ value of the output image is quite close to that of the reference image using the same parsing network.

Table 1 Similarity between the manually labeled reference mask $m_{Manu}$ and the BiSeNet allocated reference mask $m_{Ref}$ compared to the BiSeNet allocated output mask $m_{Output}$ is calculated

Full size table

Currently, the $M_{IoU}$ value of the manually labeled hair mask and output hair mask is low. This is probably because the hair mask predicted by the network loses details such as bangs. In the future, a more accurate facial segmentation network will be considered to improve the above $M_{IoU}$ value and achieve more precise hairstyle modifications.

4.3 Comparison with state-of-the-art methods

To evaluate the effectiveness of the proposed model, we compare it with state-of-the-art (SOTA) methods as shown in Fig. 13. We use face images with different hairstyle and hair color inputs (rows 1 to 5) and face images with the same hairstyle and hair color input (row 6) of different genders and skin colors. In the testing stage, we use the author’s official open-source codes and the released pre-trained model for comparison. As the image inpainting algorithms of LOHO and MichiGAN are not released, we use the image inpainting tool Generative Fill [1] instead.

Experimental results are given in Fig. 13. The proposed method can achieve accurate hair color transfer effects with different hairstyles and hair color references. Through hierarchical feature modulation, the resulting hair can keep natural blending with the background, while maintaining irrelevant attributes untouched. MichiGAN copy the hair shape from the reference image and render it with the reference color. Irrelevant areas are effectively preserved but unnatural background blending results occured. LOHO reconstructs hair by optimizing latent vectors and there is an obvious artifact around the face contour. The faces generated by HairCLIP have good texture detail. However, the hairstyle deviates from the reference shape due to its coarse control. Barbershop can generate hair similar to the reference style and color, but the hair does not align well with the face. StyleYourHair can modify the hairstyle according to the shape reference image but fails to incorporate user-needed hair color from another image.

The results of the quantitative evaluations with other methods are presented in Table 2. We choose Peak signal-to-noise ratio (PSNR), Structural Similarity (SSIM), Frechet Inception Distance (FID), and Identity Similarity (IDS) calculated by Curricularface [5] as the metrics to evaluate feature preservation ability and image quality. Similar to the other existing algorithms, we calculate PSNR and SSIM for the face region, and calculate IDS and FID for the whole image. We can notice when only modifying the hair color, the proposed method outperforms the others in terms of the comparing metrics shown in Table 2, and when modifying both the hairstyle and color, the proposed method achieves the best results under the SSIM and FID metrics. We also compare the efficiency of the methods in Table 3, and our approach can transfer hair attributes faster than the SOTA methods.

Table 2 Quantitative comparison regarding the preservation of face features and image quality

Full size table

Table 3 Comparison of the time to generate a face image with desired hairstyle and hair color

Full size table

Table 4 Quantitative evaluation of ablation analysis of our method

Full size table

4.4 Ablation analysis

To verify the effectiveness of the proposed network structure and loss functions, we alternately ablate one of the key components as shown in Fig. 14. The removal of background loss $L_{bg}$ (third column) not only changes the background area but also causes a severe change of overall color tone. The effect of removing the identity loss $L_{id}$ (fourth column) leads to a slight change in the facial features, reflecting mouth change in this sample. When the normalization loss $L_{norm}$ (fifth column)is removed, unnatural artifacts appear in the generated face image. By comparing the results with (second column) and without the ECA module (sixth column), we can see that the texture details are more natural when the ECA module is incorporated. Furthermore, the effectiveness of each loss function in maintaining facial identity and visual quality is illustrated in Table 4. In conclusion, we can infer that without background loss $L_{bg}$ or normalization loss $L_{norm}$, both the image quality and identity recognition are reduced.

5 Conclusion

In this paper, we propose a hierarchical multi-feature guided hair design system. Accurate and flexible hair style and color changes are achieved through style and color reference images. Different losses are incorporated to constrain modification within the hair region, preventing the changes of important features such as face identity, while keeping natural fusion with the background. By carefully adjust the weight of each loss functions can either generate hairstyle and color more closely to the reference image or better fit the input face. In the future, more accurate hairstyle transfer can be realized by optimizing the facial segmentation network in the future.

Data availability

FFHQ and CelebAMask-HQ dataset are used in our experiments. The FFHQ dataset is selected from https://github.com/NVlabs/ffhq-dataset, the CelebAMask-HQ dataset is selected from https://github.com/switchablenorms/CelebAMask-HQ.

References

Generative fill - ai image filler - adobe photoshop. https://www.adobe.com/products/photoshop/generative-fill.html (2023)
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4690–4699 (2019)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)
He, Z., Zuo, W., Kan, M., Shan, S., Chen, X.: Attgan: Facial attribute editing by only changing what you want. IEEE Trans. Image Process. 28(11), 5464–5478 (2019)
Article MathSciNet Google Scholar
Huang, Y., Wang, Y., Tai, Y., Liu, X., Shen, P., Li, S., Li, J., Huang, F.: Curricularface: adaptive curriculum learning loss for deep face recognition. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5901–5910 (2020)
Härkönen, E., Hertzmann, A., Lehtinen, J., Paris, S.: GANSpace: Discovering Interpretable GAN Controls. In: Advances in Neural Information Processing Systems, vol. 33, pp. 9841–9850. Curran Associates, Inc. (2020)
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134 (2017)
Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017)
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4401–4410 (2019)
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of stylegan. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8110–8119 (2020)
Kim, T., Chung, C., Kim, Y., Park, S., Kim, K., Choo, J.: Style your hair: Latent optimization for pose-invariant hairstyle transfer via local-style-aware hair alignment. In: European Conference on Computer Vision, pp. 188–203. Springer (2022)
Kong, T., Sun, F., Tan, C., Liu, H., Huang, W.: Deep feature pyramid reconfiguration for object detection. In: Proceedings of the European conference on computer vision (ECCV), pp. 169–185 (2018)
Lee, C.H., Liu, Z., Wu, L., Luo, P.: Maskgan: Towards diverse and interactive facial image manipulation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5549–5558 (2020)
Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2337–2346 (2019)
Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D., Lischinski, D.: Styleclip: Text-driven manipulation of stylegan imagery. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 2085–2094 (2021)
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning, pp. 8748–8763. PMLR (2021)
Saha, R., Duke, B., Shkurti, F., Taylor, G.W., Aarabi, P.: Loho: Latent optimization of hairstyles via orthogonalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1984–1993 (2021)
Shen, Y., Yang, C., Tang, X., Zhou, B.: Interfacegan: Interpreting the disentangled face representation learned by gans. IEEE Trans. Pattern Anal. Mach. Intell. 44(4), 2004–2018 (2020)
Article Google Scholar
Tan, Z., Chai, M., Chen, D., Liao, J., Chu, Q., Yuan, L., Tulyakov, S., Yu, N.: Michigan: multi-input-conditioned hair image generation for portrait editing. arXiv preprint arXiv:2010.16417 (2020)
Tov, O., Alaluf, Y., Nitzan, Y., Patashnik, O., Cohen-Or, D.: Designing an encoder for stylegan image manipulation. ACM Trans. Graphics (TOG) 40(4), 1–14 (2021)
Article Google Scholar
Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., Hu, Q.: Eca-net: Efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11,534–11,542 (2020)
Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional gans. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8798–8807 (2018)
Wei, T., Chen, D., Zhou, W., Liao, J., Tan, Z., Yuan, L., Zhang, W., Yu, N.: Hairclip: Design your hair by text and reference image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18,072–18,081 (2022)
Wright, L., Demeure, N.: Ranger21: a synergistic deep learning optimizer. arXiv preprint arXiv:2106.13731 (2021)
Wu, P.W., Lin, Y.J., Chang, C.H., Chang, E.Y., Liao, S.W.: Relgan: Multi-domain image-to-image translation via relative attributes. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 5914–5922 (2019)
Xiao, C., Yu, D., Han, X., Zheng, Y., Fu, H.: Sketchhairsalon: Deep sketch-based hair image synthesis. arXiv preprint arXiv:2109.07874 (2021)
Yu, C., Gao, C., Wang, J., Yu, G., Shen, C., Sang, N.: Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. Int. J. Comput. Vision 129, 3051–3068 (2021)
Article Google Scholar
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp. 2223–2232 (2017)
Zhu, P., Abdal, R., Femiani, J., Wonka, P.: Barbershop: Gan-based image compositing using segmentation masks. arXiv preprint arXiv:2106.01505 (2021)

Download references

Funding

Open Access funding provided by University of Yamanashi. This research was funded by the National Natural Science Foundation of China (grant No. 62102125).

Author information

Authors and Affiliations

Hangzhou Dianzi University, Hangzhou, China
Jiayi Xu, Chenming Zhang, Hongbin Zhang & Li Li
China Mobile(Hangzhou) Information Technology Co.,Ltd., Hangzhou, China
Weikang Zhu
University of Yamanashi, Yamanashi, Japan
Xiaoyang Mao

Authors

Jiayi Xu
View author publications
You can also search for this author in PubMed Google Scholar
Chenming Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Weikang Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Hongbin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Li Li
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyang Mao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Mao, X.Y. led the project and proposed the main idea. Xu, J.Y. designed the framework and wrote the manuscript. Zhang, C.M. developed and realized the system. Zhu, W.K., and Zhang, H.B. helped perform the experiments. All authors reviewed the manuscript.

Corresponding author

Correspondence to Xiaoyang Mao.

Ethics declarations

Conflict of interest

The authors declare no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Xu, J., Zhang, C., Zhu, W. et al. Personalized hairstyle and hair color editing based on multi-feature fusion. Vis Comput (2024). https://doi.org/10.1007/s00371-024-03468-2

Download citation

Accepted: 06 May 2024
Published: 29 May 2024
DOI: https://doi.org/10.1007/s00371-024-03468-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Personalized hairstyle and hair color editing based on multi-feature fusion

Abstract

Similar content being viewed by others

HSSAN: hair synthesis with style-guided spatially adaptive normalization on generative adversarial network

GAN with Multivariate Disentangling for Controllable Hair Editing

Style Your Hair: Latent Optimization for Pose-Invariant Hairstyle Transfer via Local-Style-Aware Hair Alignment

1 Introduction

2 Related work

2.1 GAN-based face image editing algorithms

2.2 Hair editing algorithms

3 Proposed method

3.1 Construction of dataset

3.2 Multi-model feature extraction

3.3 Attention-based latent space feature decoupling and fusion

3.4 Loss function

3.4.1 Hairstyle constraint loss

3.4.2 Background loss

3.4.3 Hair color loss

3.4.4 Normalization loss

3.4.5 Identity preservation loss

4 Experiments and analysis

4.1 Implementation details

4.2 Qualitative and quantitative evaluations

4.3 Comparison with state-of-the-art methods

4.4 Ablation analysis

5 Conclusion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation