1 Introduction

Fig. 1
figure 1

Our method is able to modify the input face based on the reference hairstyle and hair color with fine hair detail and natural blending effects

In real life and virtual social circumstances, the demand for hairdressing is everywhere. For example, when a person discovers a favorite hairstyle and hair color, but not sure if it is suitable for the makeup, he or she may resort to a virtual hair design and recommendation system, through which he or she needs only to input a frontal facial photograph and then select the suitable hairstyle and color from a set of generated portraits.

In recent years, many deep learning-based methods have explored hairstyle editing tasks. Some methods use sketches [27] or masks [11, 18, 20, 30] as constraint or reference to transfer hairstyle from one to another. Sometimes, the regenereated hair may not naturally fit the face or background as the hairstyle or color is restrictly migrated without proper modification. Text-based methods [16, 24] which extracted features from both text and images using a pre-trained Contrastive Language Image Pre-Training (CLIP) [17] encoder are usually user-friendly, but precise control on color and shape is hard to achieve due to the coarse-grained features obtained from the text descriptions. In general, most of the existing methods perform well in the hair attributes editing task, but how to achieve personalized hair modification while preserving irrelevant facial features untouched is still a challenging and important work.

User provided hair mask is used as the hairstyle modification constraint, while another face image with desired hair color is used as the hair color modification constraint. To achieve precise hair editing results, hair attributes are decoupled into hairstyle and hair color in the latent space. The feature vectors are divided into three hierarchical levels, namely coarse, medium and fine as proposed by StyleCLIP [16]. The features of reference hairstyle and hair color are fused into the latent vectors in the latent space, to better fit the modified hair according to the face shape, background and other parts of the face. More specifically, we construct three encoders to extract features from multiple input images and combine different levels of feature with the proposed modulation module. By inputting this feature vector into a pre-trained StyleGAN [9], a new face image with expected hair attributes is generated. Some editing examples are provided in Fig. 1.

Fig. 2
figure 2

The proposed framework. Our framework consists of the feature extraction module, feature fusion module, and generation module. The features of the original face image, hairstyle mask, and hair color guide image are extracted by e4e-based encoders. The extracted features are divided into three layers and combined by the modulation modules according to their granularities to form a latent space feature vector of the modified face. By employing a pre-trained StyleGAN generator, a face with expected hair attributes is generated

The main contributions of this work are summarized as follows:

  • We construct a hair editing framework consisting of three feature encoders, three latent space feature vector modulation modules and a pre-trained StyleGAN generator. The modification is focused on hairstyle and hair color without affecting other facial attributes, and desired hairstyle and hair color can either be combined or individually transferred from reference images to the target face fastly.

  • Background loss, hair color loss, normalization loss, identity loss, and hairstyle constraint loss functions are incorporated to keep balance between the preservation of facial attributes and the modification of hair attributes, ensuring precise editing effects within hair region and natural blending results with the background.

  • We design modulation modules based on attention mechanisms to divide the latent space feature vectors into coarse, medium, and fine levels. The proposed algorithm achieves cross-channel fusion of feature vectors and realizes accurate and detailed hairstyle and hair color modification from multiple reference images.

2 Related work

2.1 GAN-based face image editing algorithms

Since the introduction of GAN models, the quality and precision of generated facial images have significantly improved. In 2014, Mirza et al. [14] proposed the conditional GAN (CGAN), which replaced random noise input with conditional input, enabling controlled facial image generation. In 2017, Isola et al. [7] introduced the pix2pix model based on the concept of CGAN that can transfer the features of the reference image to the target image. Furthermore, Zhu et al. [29] developed the CycleGAN model based on the pix2pix framework and achieved cross-domain facial attribute transfer effects through a cyclic generation network. To generate higher resolution images, Wang et al. [23] proposed the pix2pixHD by refining the pix2pix model. Park et al. [15] presented a Spatially-Adaptive Denormalization (SPADE) network by utilizing spatially adaptive normalization modules, to generate realistic images from user-specified semantic masks. Moreover, MaskGAN [13] focused on facial image generation based on fine-grained semantic masks, enabling interactive facial image editing. Although CGAN-based face editing algorithms can perform hairstyle and hair color editing partially, the quality of the generated images is unsatisfactory. Additionally, due to the development of StyleGAN network [9], the quality of generated faces has been greatly improved. Downstream tasks of editing faces by modifying the latent space of a pre-trained StyleGAN have been extensively explored. Based on the pre-trained StyleGAN, the existing approaches mapped the input conditions and noises into different levels of the latent to realize the generation diversity [4, 6, 19, 26]. As a special case of local face editing, hair editing can also be achieved by manipulating the feature vector of the StyleGAN latent space.

2.2 Hair editing algorithms

For more challenging attribute control on hair editing tasks, Tan et al. [20] introduced CGAN-based MichiGAN, which used multiple conditional inputs to achieve hair transfer. The hair information was decomposed into four attributes as input: shape, structure, appearance, and background. However, the coupling between facial attributes results in unnatural generation when arbitrary hairstyle modifications are performed. Saha et al. [18] proposed a two-stage hair transfer framework called LOHO, which achieved hairstyle transfer effects by optimizing the latent feature space of StyleGAN2 [10]. However, the generated hair does not fit the face contour and image background well. Considering hair editing as an attribute transfer task, Wei et al. [24] proposed HairCLIP to modify the hairstyle and hair color. This approach introduced the CLIP model [17] to achieve text-to-image transformation. By inputting the guiding hairstyle in the form of an image or text description, modulation mechanism for style fusion is used to realize hair modification. Due to the coarse feature representation ability of text description, this method can only roughly constrain the hair change. Furthermore, Zhu et al. [30] utilized a semantic mask as a hairstyle reference in their work called Barbershop. They expanded the embedded \(W^+\) latent space into a proposed FS space, encoded more fine-grained spatial information and further improved the visual quality of the outputs. Moreover, kim et al. [11] proposed StyleYourHair model based on Barbershop model [30]. The spatial alignments of hair and face were achieved by introducing a local style matching loss in the network training.

Inspired by the modulation concept proposed by HairCLIP [24], we design a personalized hairstyle and hair color editing system based on a pre-trained StyleGAN model. Additionally, we design a multi-feature extraction and fusion framework that incorporates both the hairstyle and color features from different reference images, to accurately control the editing of hairstyle and hair color. The proposed method achieves natural hair blending effects with the input face and background.

3 Proposed method

Based on the idea of modifying hairstyle using semantic facial masks, guiding hair color transfer with hair from the reference image, while retaining other features from the original face. This paper proposes a hairstyle and hair color editing model based on multi-feature fusion and modulation. The framework mainly consists of three parts, as shown in Fig. 2.

Firstly, the multi-modal feature extraction module with three encoders is built, which includes a face encoder to preserve input facial features from original face image, an encoder to extract features from hair shape masks based on semantic segmentation, and an encoder to extract hair color information from hair color reference images. Mapping three types of features into the same latent space helps in adjusting and combining hairstyle, hair color, and original facial features according to the user’s needs. Secondly, the attention-based latent space feature decoupling and fusion module is constructed. Based on the idea of StyleGAN latent space hierarchical feature division, hairstyle and hair color are decoupled from other facial features at coarse, medium, and fine levels of granularities. By adding the attention mechanism into the modulation process, accurate extraction and mixture of independent hair features is achieved. Finally, the modulated hybrid latent space feature vector is used as the regeneration input, and a pre-trained StyleGAN2 model is employed to regenerate face images with the desired hairstyle and color.

Fig. 3
figure 3

Augmentation results of rare hair colors

3.1 Construction of dataset

A training dataset based on the CelebAMask-HQ [8] dataset is constructed. The CelebAMask-HQ consists of more than 30,000 face images with 512\(\times \)512 resolution. Each image in the dataset has a matching mask which contains 19 facial semantic categories(i.e., hair, facial features, background, etc.). Additionally, CelebAMask-HQ dataset contains only common hair colors (i.e., blonde, brown, and black) and there are few available images with other hair colors.

The data required for the training can be classified into three types: the original input face image, hairstyle reference mask image, and hair color reference image, as shown in Fig. 2. From CelebAMask-HQ dataset, we select 20,000 images only as original input face images. Furthermore, we collect 1000 annotated segmentation mask images for hairstyle reference mask images. Moreover, we collect 11 hair colors from the Internet, such as: gold, brown, blue, purple, red, pink, and white, and then expand them as shown in Fig. 3, to overcome the limited data of the hair colors. Besides, 3000 facial images with different hair colors are used as hair color reference images in this paper.

The hair color expansion is implemented based on the idea of style transfer, encoder-decoder architecture that uses the semantic facial mask to constrain the hairstyle generation, and hair color reference image to constrain the hair color generation. The encoder uses ResNet50 [3] as the backbone network, extracts multi-layer features based on the pyramid structure [12], and stacks them to obtain color-related style feature vectors. The SPADE module [15] is introduced into the ResNet50 network, and a hairstyle mask image is fused into each SPADE block as a shape guidance to obtain feature vectors containing hairstyle information. The decoder uses a pre-trained StyleGAN2 generator to decode the combined feature vector obtained by fusing color and style control feature vectors, and finally generates a new face image with the desired hairstyle and hair color.

3.2 Multi-model feature extraction

In this subsection, we aim to accurately extract multiple features that guide the generation of new hair. We propose a group of three parallel encoders including: an original input encoder that keeps most of the face features and prevents irrelevant hair regions from being modified, a shape-constrained hairstyle encoder that uses a hairstyle mask as the reference, and a hair color encoder that migrates hair color from a reference face to the original input face image. We employ the encoder4editing (e4e) [21] network as a basic encoder structure, which performs well in the GAN inversion tasks. More specifically, the e4e-based encoders generalize \(W^+\) space to \(W^k_*\) space of StyleGAN, where \(k=18\) denotes the dimension of the output feature vectors. Based on the StyleGAN hierarchical levels, the latent space feature vector w(\(w_1 \sim w_{18} \in \mathbb {R}^{512}\)) is divided into sub-vectors of \(w_1 \sim w_3\), \(w_4 \sim w_7\), and \(w_8 \sim w_{18}\) in the hierarchical levels of StyleGAN, that represent coarse level features \(w_c\), medium level features \(w_m\), and fine level features \(w_f\), respectively.

The original input encoder and hair color encoder work under the real image domain, therefore, a pre-trained e4e encoder is used in this work. For the hairstyle encoder, as the semantic facial mask which belongs to a non-real image domain is used as an input, it is necessary to train it and align all of the three encoders into the same feature space. The hairstyle encoder is retrained on the Flickr-Faces-HQ dataset (FFHQ) [9]. The generation results with the facial mask and noise as inputs are shown in Fig. 4. Overall, we can see that the face shape information is maintained with diverse generation results.

Fig. 4
figure 4

Face generation results of the mask encoder. Noise perturbations are combined with the guiding mask (first column) as the generation constraints, therefore, both shape conformity and generation diversity are realized

3.3 Attention-based latent space feature decoupling and fusion

Following the idea of CGAN, the attribute editing effect can be realized through the modification of the latent space feature vector of the face image. However, due to the entanglement of facial features, the clear modification of hair area can hardly be achieved by simple interpolation or optimization of the latent vectors. To solve this issue and achieve accurate hair editing, we propose a hierarchical multi-feature modulation mechanism to achieve accurate hair editing, which can effectively weave the hairstyle and color constraints into the facial regeneration procedures.

Fig. 5
figure 5

Per-level structure of the modulation module \(M_c\), \(M_m\), and \(M_f\)

The modulation step has three parallel modules, \(M_c\), \(M_m\), and \(M_f\) that output a control vector \(\varDelta w'\) after fusing the original face image, hairstyle reference mask image, and hair color reference image as shown in Fig. 5. ECA module is added into the modulation module proposed by HairCLIP [24] to finetune the weights of the feature vectors at each level with better fusion results. The coarse level latent space sub-vector of the original face image \(w_c\) and the style reference mask image \(w^s_c\) are combined with the modulation module \(M_c\) to get a new sub-vector \(\varDelta w_c\). The medium level latent space sub-vector of the original face image \(w_m\) and the style reference mask image \(w^s_m\) are combined with the modulation module \(M_m\) to get a new sub-vector \(\varDelta w_m\). The hairstyle revision is mainly controlled by \(\varDelta w_c\) and \(\varDelta w_m\). The fine level latent space sub-vector of the original face image \(w_f\) and the color reference image \(w^c_f\) are combined with the modulation module \(M_f\) to get a new sub-vector \(\varDelta w_f\) to control the hair color change. The final control vector \(\varDelta w'\) is computed by concatenating the modulated sub-vectors \(\varDelta w_c\), \(\varDelta w_m\), and \(\varDelta w_f\) through \(F_{cat}()\) as expressed in Eq. (4):

$$\begin{aligned} \varDelta w_c=\lambda _c\times w_c+(1-\lambda _c)\times w_c^s \end{aligned}$$
(1)
$$\begin{aligned} \varDelta w_m=\lambda _m\times w_m+(1-\lambda _m)\times w_m^s\ \end{aligned}$$
(2)
$$\begin{aligned} \varDelta w_f=\lambda _f\times w_f+(1-\lambda _f)\times w_f^c \end{aligned}$$
(3)
$$\begin{aligned} \varDelta w'=F_{cat}(\varDelta w_c,\varDelta w_m,\varDelta w_f) \end{aligned}$$
(4)

where coefficients \(\lambda _c\), \(\lambda _m\), and \(\lambda _f\) are used to adjust the control effect of the coarse, medium, and fine levels, respectively. The ultimate latent space feature vector of the modified face image \(w'\) is calculated by element-wise addition of the final control vector \(\varDelta w'\) and latent vector of the original face w and then normalizing the result.

As shown in Fig. 5, each modulation module includes a basic block of a fully connected layer, a modulator, a LeakyReLU layer, and an ECA attention layer [22]. \(M_c\) and \(M_m\) have five basic blocks, and \(M_f\) has ten basic blocks. The calculation process of the modulator can be expressed as in Eq. (5):

$$\begin{aligned} x^{\prime }=(1+f_\gamma (e))\frac{x-\mu _\chi }{\sigma _\chi }+f_\beta (e) \end{aligned}$$
(5)

where x represents the output of the previous modulator, \(x'\) represents the output of the current modulator, \(w^s\) is the latent vector of the hairstyle reference mask image, and \(w^c\) is the latent vector of the hair color reference image. \(\mu _x\) and \(\sigma _x\) represent the input mean and variance of the corresponding modulation layer, respectively, to perform the normalization operation. \(f_{\beta }\) and \(f_{\gamma }\) are fully connected layers. It is possible to realize independent modification on the hairstyle or hair color by setting the \(w^c\) or \(w^s\) as null and disabling the corresponding modulator in each basic block.

Fig. 6
figure 6

Structure of the ECA attention model with a convolution kernel size of 5

To better integrate these features, ECA [22] (Efficient Channel Attention) is introduced into the modulation module. The structure of ECA is shown in Fig. 6. To ensure that the attention module is dimensionally matched to the input features, an adaptive convolutional kernel is adopted to learn the importance of different channels. A larger convolution kernel is used in layers with more channels to allow more cross-channel feature fusion, while a smaller convolution kernel is used in layers with fewer channels to allow less cross-channel feature fusion. The adaptive convolutional kernel is defined in Eq. (6):

$$\begin{aligned} k=\psi (C)=|\frac{\log _2(C)}{\gamma }+\frac{b}{\gamma }|_{odd} \end{aligned}$$
(6)

where k denotes the convolutional kernel size, C denotes channel dimension, \(|z|_{odd}\) means the nearest odd number to z. In this paper, the variables \(\gamma \) and b, which are used to change the ratio between channels C and the convolutional kernel size k, are set to 2 and 1, respectively.

Fig. 7
figure 7

Our method is able to modify the hair color and hairstyle according to the reference images and generate realistic editing results. In each example, the large image on the left is the generated image, and the inset images on the right represent the target hair attributes in the order of: input face image, hairstyle reference mask, and hair color reference image. The proposed method can change hair style and color simultaneously or independently

3.4 Loss function

The proposed framework consists of three encoders, connecting three feature modulation modules, and a pre-trained StyleGAN generator. The three encoders are pre-trained as described in Subsection 3.2. The three feature modulation modules are trained using five loss functions including: hairstyle constraint loss \(L_{hairstyle}\), background loss \(L_{bg}\), hair color loss \(L_{color}\), normalization loss \(L_{norm}\), and identity preservation loss \(L_{id}\).

3.4.1 Hairstyle constraint loss

To generate a new face with a desired hairstyle, the target hair shape is obtained from the hair region of hairstyle reference mask image \(I_{mask}\), while the hair region of the generated face \(I_{new}\) is allocated by a pre-trained face segmentation network [28]. \(D_{hair}()\) is the density distribution map that is used to indicate the probability of a pixel in the hair region. The \(L_{hairstyle}\) is calculated by measuring the pixel-wise coincidence between \(I_{mask}\) and \(I_{new}\) as defined in Eq. (7):

$$\begin{aligned} \begin{aligned}L_{hairstyle} = || D_{hair}({I_{new}}) \times ( 1 - I_{mask})\\ - D_{hair}({I_{new}}) \times {I_{mask}}||_2\end{aligned} \end{aligned}$$
(7)

where \({I_{mask}}\) denotes the hair region in the hairstyle reference mask image, \(D_{hair}({I_{new}})\) denotes the density distribution of the hair region in generated face image. The equation is calculated in the form of L2-norm, where a low value of \(L_{hairstyle}\) means a small difference between the hair shape of the reference and modified images.

3.4.2 Background loss

The background is defined as the whole image subtracting the segmented hair region. The intersection region of the background of the original face image and the generated face image is calculated, and named as \(B_{ori}\) for the original image \(I_{ori}\), \(B_{new}\) for the generated image \(I_{new}\), respectively. Background loss \(L_{bg}\) is designed to keep the irrelevant hair region unchanged, as shown in Eq. (8):

$$\begin{aligned} L_{bg}=|| B_{ori} - B_{new}||_2 \end{aligned}$$
(8)

3.4.3 Hair color loss

The respective hair area between the original face image and generated face image is calculated, and named as \(H_{ori}\) for the original image, and \(H_{new}\) for the generated image, respectively. Hair color loss \(L_{color}\) is defined to measure the difference between the hair color of the reference image and the generated image by computing the pixel-wise L2-norm.

$$\begin{aligned} L_{color}=|| H_{ori} - H_{new}||_2 \end{aligned}$$
(9)

3.4.4 Normalization loss

As hair color is influenced by fine level latent feature sub-vector, while hairstyle-related geometry structure is mainly related to coarse level latent feature sub-vector, we apply normalization loss to restrict feature change in the medium level of latent space as described in Eq. (10).

$$\begin{aligned} L_{norm}=||\varDelta w^{ori}_m - \varDelta w^{new}_m||_2 \end{aligned}$$
(10)

where \(\varDelta w^{ori}_m\) is the medium level latent feature sub-vector of the original face image, \(\varDelta w^{new}_m\) is the medium level latent feature sub-vector of the generated face image.

Fig. 8
figure 8

Independent hair color transfer samples for male and female

Fig. 9
figure 9

Independent hairstyle transfer samples for male and female

3.4.5 Identity preservation loss

To further preserve the original face features, we focus on the preservation of the identity features. By utilizing a pre-trained ArcFace [2] face recognition network, 50-dimensional identity feature vectors are extracted from both the original face and the generated face. The cosine distance cos() is employed to calculate the similarity between two identity features x and y. The identity preservation loss \(L_{id}\) is defined as follows:

$$\begin{aligned} L_{id}=1-cos\left( R(x),R(y)\right) \end{aligned}$$
(11)

where R() is the identity feature extraction function. In summary, the overall loss function is defined as follows:

$$\begin{aligned} L{} & {} = \lambda _{hairstyle}L_{hairstyle} +\lambda _{color}L_{color}\nonumber \\ {}{} & {} \quad +\lambda _{id}L_{id} +\lambda _{bg}L_{bg}+\lambda _{norm}L_{norm} \end{aligned}$$
(12)

where \(\lambda _{hairstyle}\), \(\lambda _{color}\), \(\lambda _{id}\), \(\lambda _{bg}\), \(\lambda _{norm}\) are hyperparameters. By combining these loss functions into a comprehensive framework, the carefully tuned hyperparameters enable us to strike a balance between preserving face features and achieving the desired level of hair modification.

4 Experiments and analysis

To validate the efficiency of the proposed model, qualitative and quantitative experiments are conducted. Furthermore, we compare our hair design system against state-of-the-art (SOTA) approaches. Moreover, we perform comprehensive ablation experiments to verify the necessity and contribution of each loss function within our proposed framework.

4.1 Implementation details

We conduct all the experiments on the extended CelebA

Mask-HQ dataset as introduced in Subsection 3.1. The proposed framework is trained on the NVIDIA GeForce RTX 3090 GPU with 24 G memory. During the training, the Ranger [25] optimizer is used, with an initial learning rate of 0.1. After every 10,000 training iterations, the learning rate decays to 0.95 times. The StyleGAN2 [10] generator pre-trained on the FFHQ dataset [9] is used as our generator. For the hyperparameters of the loss functions, we set \(\lambda _{hairstyle}=1\), \(\lambda _{color}=0.01\), \(\lambda _{id}=0.4\), \(\lambda _{bg}=1\), \(\lambda _{norm}=0.8\). For the hyperparameters of the modulation modules, we set \(\lambda _c=0.7\), \(\lambda _m=0.9\), \(\lambda _f= 0.2\).

Fig. 10
figure 10

The effects of changing from long hair to short hair, and from short hair to long hair based on reference masks

4.2 Qualitative and quantitative evaluations

This research focuses on accurate transfer of hair attributes while ensuring natural hair blending with the face and background. Therefore, we conduct a series of experiments to show the efficacy of the proposed model on modifying hair color and style. Hairstyle and color design results of males are given in the left half of Fig. 7, and the results of females are given in the right half. We can notice that, although the reference hair color and hairstyle images may have different genders and key facial attributes compared with original face (the top right image in each example), the generated face image (large image in each example) achieves accurate hair modification result while maintaining stable identity information. Due to the hairsytle and face contour differences between the input face and the reference face, a strict or adaptive hairstyle transfer effect can be achieved by adjusting the hyperparameters of loss functions. In current implementation, stringent hair shape transfer is employed which may sometimes result in less natural blending effect as shown in Fig. 1.

The proposed method can achieve the effect of modifying the hairstyle and hair color independently or at the same time. The experimental results of changing only hair color are given in Fig. 8. The proposed method can correctly migrate rare colors and get natural hair texture effect by training on the augmented hair dataset. The experimental results of changing only hairstyles are given in Fig. 9. It can be seen from the results that the regenerated hair can fit face contour well, and blend with the background in a coherent way. Furthermore, we conduct experiment on modifying hairstyles with different hair lengths as shown in Fig. 10.

Fig. 11
figure 11

Hair length shortening results. The hair blends naturally with the background as indicated by the red dashed boxes

In this experiment, mask encoder efficiently extracts shape features, and coarse and medium level vectors are modified to constrain the modification scope. Consequently, when shorting the hair length, the modified hair region blends well with the background, as indicated by the red dashed box in Fig. 11.

To quantitatively investigate whether the proposed method can accurately learn the reference hairstyle, we compare the similarity between hair regions of the output face image \(m_1\) and input face image \(m_2\) using intersection over union \(M_{IoU}\) as defined in Eq. (13):

$$\begin{aligned} M_{IoU}=1-\frac{1}{N}\sum _{\begin{array}{c}i=1\\ \end{array}}^N\Vert m_ 1-m_2\Vert _1 \end{aligned}$$
(13)

The pre-trained face parsing network BiSeNet [28] is employed to allocate the hair mask region of a face image. Pixel-wise L1-norm between the two hair mask regions is calculated. We randomly sample two face images and compute hair masks using different approaches as shown in Fig. 12. The corresponding \(M_{IoU}\) values of manually labeled \(m_{Manu}\) and BiSeNet allocated \(m_{Ref}\) compared to the BiSeNet allocated output mask \(m_{Output}\) are listed in Table 1. We can see that, the hair shape and the \(M_{IoU}\) value of the output image is quite close to that of the reference image using the same parsing network.

Fig. 12
figure 12

Hair segmentation results of the input face image and generated face image. The hair masks are allocated by manual extraction or BiSeNet segmentation network

Table 1 Similarity between the manually labeled reference mask \(m_{Manu}\) and the BiSeNet allocated reference mask \(m_{Ref}\) compared to the BiSeNet allocated output mask \(m_{Output}\) is calculated
Fig. 13
figure 13

Visual comparison with MichiGAN [20], LOHO [18], HairCLIP [24], Barbershop [30], and StyleYourHair [11]. Our method demonstrates accurate hair transfer, stable preservation of the irrelevant features, and natural blending effects

Currently, the \(M_{IoU}\) value of the manually labeled hair mask and output hair mask is low. This is probably because the hair mask predicted by the network loses details such as bangs. In the future, a more accurate facial segmentation network will be considered to improve the above \(M_{IoU}\) value and achieve more precise hairstyle modifications.

4.3 Comparison with state-of-the-art methods

To evaluate the effectiveness of the proposed model, we compare it with state-of-the-art (SOTA) methods as shown in Fig. 13. We use face images with different hairstyle and hair color inputs (rows 1 to 5) and face images with the same hairstyle and hair color input (row 6) of different genders and skin colors. In the testing stage, we use the author’s official open-source codes and the released pre-trained model for comparison. As the image inpainting algorithms of LOHO and MichiGAN are not released, we use the image inpainting tool Generative Fill [1] instead.

Experimental results are given in Fig. 13. The proposed method can achieve accurate hair color transfer effects with different hairstyles and hair color references. Through hierarchical feature modulation, the resulting hair can keep natural blending with the background, while maintaining irrelevant attributes untouched. MichiGAN copy the hair shape from the reference image and render it with the reference color. Irrelevant areas are effectively preserved but unnatural background blending results occured. LOHO reconstructs hair by optimizing latent vectors and there is an obvious artifact around the face contour. The faces generated by HairCLIP have good texture detail. However, the hairstyle deviates from the reference shape due to its coarse control. Barbershop can generate hair similar to the reference style and color, but the hair does not align well with the face. StyleYourHair can modify the hairstyle according to the shape reference image but fails to incorporate user-needed hair color from another image.

The results of the quantitative evaluations with other methods are presented in Table 2. We choose Peak signal-to-noise ratio (PSNR), Structural Similarity (SSIM), Frechet Inception Distance (FID), and Identity Similarity (IDS) calculated by Curricularface [5] as the metrics to evaluate feature preservation ability and image quality. Similar to the other existing algorithms, we calculate PSNR and SSIM for the face region, and calculate IDS and FID for the whole image. We can notice when only modifying the hair color, the proposed method outperforms the others in terms of the comparing metrics shown in Table 2, and when modifying both the hairstyle and color, the proposed method achieves the best results under the SSIM and FID metrics. We also compare the efficiency of the methods in Table 3, and our approach can transfer hair attributes faster than the SOTA methods.

Table 2 Quantitative comparison regarding the preservation of face features and image quality
Table 3 Comparison of the time to generate a face image with desired hairstyle and hair color
Fig. 14
figure 14

Visual ablation analysis of the proposed method

Table 4 Quantitative evaluation of ablation analysis of our method

4.4 Ablation analysis

To verify the effectiveness of the proposed network structure and loss functions, we alternately ablate one of the key components as shown in Fig. 14. The removal of background loss \(L_{bg}\) (third column) not only changes the background area but also causes a severe change of overall color tone. The effect of removing the identity loss \(L_{id}\) (fourth column) leads to a slight change in the facial features, reflecting mouth change in this sample. When the normalization loss \(L_{norm}\) (fifth column)is removed, unnatural artifacts appear in the generated face image. By comparing the results with (second column) and without the ECA module (sixth column), we can see that the texture details are more natural when the ECA module is incorporated. Furthermore, the effectiveness of each loss function in maintaining facial identity and visual quality is illustrated in Table 4. In conclusion, we can infer that without background loss \(L_{bg}\) or normalization loss \(L_{norm}\), both the image quality and identity recognition are reduced.

5 Conclusion

In this paper, we propose a hierarchical multi-feature guided hair design system. Accurate and flexible hair style and color changes are achieved through style and color reference images. Different losses are incorporated to constrain modification within the hair region, preventing the changes of important features such as face identity, while keeping natural fusion with the background. By carefully adjust the weight of each loss functions can either generate hairstyle and color more closely to the reference image or better fit the input face. In the future, more accurate hairstyle transfer can be realized by optimizing the facial segmentation network in the future.