1 Introduction

Image inpainting is a task of generating visually realistic content in missing regions of corrupted input images. It has a wide range of applications. For example, it allows removing unwanted objects from an image or synthesizing features under occlusion areas. Face inpainting is an interesting and challenging branch. The main challenge of face inpainting lies in that face is a region with strong structural and semantic features while the texture of the facial features is continuous and unobtrusive. These unique properties require the face inpainting not only to recover reasonable semantic information but also to ensure the continuity of texture and structure.

Recently, along with the development of deep learning, significant progress has been made in image and face inpainting [12, 17, 21, 34, 35, 39]. Although these prior works can generate proper results for the missing region, they cannot ensure high-fidelity. It is widely assumed that, once the missing region consists of a large continuous area or involves key information, it is almost impossible to guarantee the repaired results exactly match the ground truth.

Nevertheless, we can still improve the authenticity of the inpainting under reasonable assumptions. Compared with intricate natural scenes, human faces are easier to make reasonable assumptions. The human face involves exact attributes including eyes, nose, mouth, cheeks, beard, and other features. Specifically, taking mouth as an example, there are big and small kinds in size, open and close in posture. If attributes in the occluded face area are determined in advance according to ground truth, for example, the face has a beard or the mouth is open, the generated result may be more similar to the ground truth. Unfortunately, this is a paradox. The main challenge of image inpainting is the lack of information. Thus, the basic assumption of this paper is that we can obtain face attributes in advance and these face attributes are in line with ground truth.

Based on this observation, our main goal is then to ensure the repaired results meeting the basic semantics content, and visually approximate to the ground truth under the guidance of face attributes during face inpainting (as shown in Fig. 1). Wearing masks in public is now a consensus, which happens to be a practical application scenario for mask removal. Thus, we proposed a face inpainting method for mask removal accordingly. To simulate the actual scene of wearing a mask, we made a synthetic facial occlusion that mimics the real mask, resulting in 1) large-scale irregular continuous area is occluded; 2) the lower half of the face including nose, cheek, mouth, and other face details are occluded. The main contributions of our paper can be summarized as:

  1. 1.

    A novel face inpainting method aiming to remove face masks as well as realistically recover the missing face region under the guidance of the face attributes is proposed.

  2. 2.

    We design a novel dual pipeline network structure with an attention mechanism to effectively exploit both known and unknown regions’ information to obtain semantic results.

  3. 3.

    Experiments results compared with the state-of-the-arts demonstrate that our method can achieve competitive results.

Fig. 1
figure 1

Face image inpainting results by our approach. It takes mask-wearing faces as input and facial attributes as guidance to obtain the mask removal result

2 Related work

2.1 Automatic image inpainting

The approaches related to image inpainting can be roughly classified into two categories: patch-based or diffusion-based methods and deep learning methods.

Patch-based methods [1, 4, 8, 9, 14] fill the missing region patch-by-patch by searching and extending pixels in the undamaged area of the image. These algorithms work well on images that the unobstructed areas that have the contents for obstructed regions, which cannot be guaranteed in most real scenes. Diffusion-based methods [14, 16, 23] fill the missing region by propagating the neighborhood content to the missing holes, which are robust for dealing with small holes but fail to complete complex images with semantic structures.

Deep learning methods are proposed to image inpainting that directly generates the missing regions. Pathak et al. [18] proposed a context encoder to extract features and then decode the features to reconstruct the output. Iizuka et al. [7] proposed global and local discriminators to handle high-resolution images and generate more realistic results. Yu et al. [34] proposed an end-to-end image inpainting model with contextual attention to borrowing information from surroundings. However, these methods can only solve rectangular masks. To handle irregular masks, Liu et al. [12] proposed partial convolution to merely utilize valid pixels and update the masks and weights with layers.

However, partial convolution [12] heuristically classifies all spatial locations to be either valid or invalid and the update way is un-learnable. Therefore, gated convolution [35] is proposed to learn a dynamic feature selection mechanism for each spatial location and obtain competitive results. After that, Sagong et al. [22] propose PEPSI to improve the structure of gated convolution which reduces the number of convolution operation almost by half but exhibits superior performance to other models in terms of testing time and qualitative scores.

Besides, Yi et al. [32] propose a Contextual Residual Aggregation (CRA) mechanism that can produce high-frequency residuals for missing contents by weighted aggregating residuals from contextual patches, thus only requiring a low-resolution prediction from the network. CRA enables large images (up to 8K) with considerable hole sizes to be inpainted with limited memory and computing resources, which is intractable for prior methods. The disadvantage is that it is not possible to generate multiple styles of repair images. Wu et al. [28] propose a new end-to-end and coarse-to-fine generative model through combining a local binary pattern (LBP) learning network with an actual inpainting network to tackle various unpleasant artifacts, especially in the boundary and highly textured regions. But how to select these two sub-networks is very crucial and could lead to vastly different results. Zhao et al. [38] present Unsupervised Cross-space Translation Generative Adversarial Network (UCTGAN). The network realizes one-to-one mapping between instance image space and conditional completion image space, which can significantly reduce the possibility of mode collapse and improve the diversity of restored images. But it can’t inpaint images including free-form masks.

Recently, Qin et al. [19] propose one method based on weighted facial similarity for face inpainting with large missing regions. Han et al. [5] propose one face image inpainting method with evolutionary generators to overcome the gradient vanishing problem. Yang et al. [31] use paired discriminator to inpaint damaged face images to keep stronger semantic consistency. These methods achieve good image inpainting results by optimizing the structure, but they do not combine the original semantics of human face well.

2.2 Guided image inpainting

In general image inpainting tasks, input includes a corrupted image as well as a mask that indicates missing pixels. Blind image inpainting like [26] only takes corrupted images as input and adopts mask prediction network to estimated masks. However, more inpainting methods adopt additional input besides image and mask to improve inpainting results. Yu et al. [35] input user sketch to extend inpainting network as a user-guided system. Ren et al. [21] use incomplete image structure as input which is firstly reconstructed as global structure information in texture generation. Nazeri et al. [17] generate edges of missing regions of the image and use hallucinated edges as a priori to guide inpainting. Xiong et al. [30] learn to predict the foreground contour first and then repair the missing region using the predicted contour as guidance. However, these works mainly focus on vision information as additional input. To acquire more semantically accurate inpainting images, Zhang et al. [36] propose a novel model combining the descriptive text as part of the input. In this paper, we use face attributes as an additional input to guide inpainting results.

2.3 Face generation with attributes

Face generation is a challenging problem until the rapid advancement of the generative adversarial network (GAN). The original GAN work [3] introduced a novel framework which contains two feed-forward networks, a generator G and a discriminator D. At present, countless GAN variants [11, 15, 20, 33, 40] are proposed in many computer vision applications. Thus, face generation problems obtain great development. To generate faces of different ages from an arbitrary query face without knowing its true age, CAAE [37] was proposed, which used age as attributes in training. Xiao et al. [29] proposed ELEGANT using face attributes to transfer face features between two face images. Nasir et al. [17] used textual descriptions of the face in face generation. However, in the above studies, most of input images are complete and thus semantic information could be utilized to guide image generation. While in inpainting problems, the image is corrupted and the feature is no longer complete. Nevertheless, we can still get some inspiration from above works because all of us need to use face attributes as part of input to generate human faces. Hence, how to use attributes and combine them with latent features is the key problem in face inpainting, which will be discussed in Section 3.1.

3 Approach

The detailed structure of the proposed network is shown in Fig. 2. Suppose we have a complete image Ig, which serves as ground truth. Then Ig is degraded by a piece of masked regions M and became Im. Im is a masked image needed to be repaired and its complement is defined as Ic. Besides, some face attributes of Ig will be considered as part of the input, named Iattr.

Fig. 2
figure 2

Overview of our framework. The upper one is reconstructive path only used in training. The lower one is generative path used both in training and testing. Face attributes are fed to guide the inpainting results during the whole process

3.1 Dual pipeline network structure

The framework consists of two paths: a reconstructive path and a generative path. The upper path is a reconstructive path using information from the whole image including Ic and Im. Ic is used to infer the whole image’s latent information only in training. The lower path is the generative path. Like other one-path inpainting networks, it only uses Im as input to obtain latent information and will be used both in training and testing. Both the reconstructive path and the generative path will share identical weights.

The input and output face images are 256 × 256 RGB images. A 5-layer neural network with residual modules is adopted as the encoder, aiming to extract the semantic information about the masked area from Im and Ic.

$$ Z_{c/m} = Enc(I_{c}/I_{m}) $$
(1)

where Enc denotes the image encoder. In the reconstructive path, Ic will be fed into the encoder. As Ic has the information in masked regions, Zc denotes the effective area which represents the true characteristics of the occluded area. While in the generative path, Zm is the output of the encoder and has the information mainly about the unmasked region of Im.

As we use face attributes as guidance, we select 7 representative attributes including the big nose, chubby face, makeup face, gender is male, open mouth, no beard, and young face according to CelebA [13]. We define these face attributes as one vector in which 1 is true and -1 is false. The details are shown in Section 4.1. As for the input of face attributes, we also use an encoder to change the label into a feature tensor.

$$ Z_{attr} = Enc_{attr}(I_{attr}) $$
(2)

where Encattr is the face attributes encoder and Iattr is an attributes vector. Zattr is the feature tensor indicating unique characteristic. In practice, we ensure the Zattr has the same dimension with Zm and Zc.

After extracting the image’s high-level features of the input face in Ic and Im, along with face attributes Zattr , these tensors are concatenated and fed into a distribution network.

$$ Z = [Z_{attr} + Z_{m}, Z_{c}] $$
(3)
$$ P(Z) = \mathcal{N}(Z) $$
(4)

where is a sampler and P(Z) is distribution of Z. Consider Zc has specific semantic feature and Zattr has abstract face attributes in masked areas while Zm only has unmasked areas information and need to infer and generate masked face, we prefer adding Zattr and Zm both in training and testing rather than concatenating to guide generative path obtain semantic face images. As Zc is only available in training, we utilize it to get a prior distribution of effective area and assist generative path to get a distribution that is closer to the real situation. The effectiveness of this process will be further discussed in Section 4.4.

Inspired by [39], we feed the low-level image features from the encoder layer to the decoder layer in the generator through a high way path based on short+long term attention. Besides, it will also be shared with the reconstructive path in the same layer. The short+long term attention can maintain fine features in the encoder and obtain semantical features in the decoder.

3.2 Discriminator on face images

Follow the principle of GAN, the discriminator firstly require the generator to get more realistic faces. And both in the reconstructive path and generative path, the discriminator Dr and Dg are based on LSGAN [15] which is better than original GAN according to our research.

$$ \mathcal{L}_{ad}^{r} = \Vert D_{r}(I_{rec}) - D_{r}(I_{g}) \Vert_{2} $$
(5)

where adr is the adversarial loss in reconstructive path and Irec is reconstructive output image.

$$ \mathcal{L}_{ad}^{g} = [ D_{g}(I_{gen}) - 1]^{2} $$
(6)

where adg is the adversarial loss in generative path and Igen is generative image.

Since face attributes are used as part of the input, the output face image should also correspond to the expected attributes. To distinguish the attributes of output face images, a pre-trained face detection model named Dattr was build using ResNet [25] as the backbone. Dattr is used to detect both reconstructive and generative face images and give confidence for face attributes.

$$ \mathcal{L}_{attr}^{r} = \sum\limits_{i = 0}^{6} \vert D_{attr}(I_{rec}^{(i)}) - I_{attr}^{(i)}\vert $$
(7)
$$ \mathcal{L}_{attr}^{g} = \sum\limits_{j = 0}^{6} \vert D_{attr}(I_{gen}^{(j)}) - I_{attr}^{(j)}\vert $$
(8)

where attrr is the loss for attributes in reconstructive path and attrg is the loss in generative path. As we have 7 face attributes as input, i and j are the index for different attributes. Besides, Dattr(Irec(i)) ∈ [− 1,1], Dattr(Igen(j)) ∈ [− 1,1] and Iattr(i/j) ∈{− 1,1}. If the score of an attribute is 0, it means that the attribute is difficult to distinguish, and the result of image inpainting is not consistent with ground truth. Once the loss score of attrr or attrg reduced, it means the attributes of reconstructive or generative face image are identical with input attributes. The specific results of this discriminator will be shown in Section 4.4.

3.3 Loss function

The proposed appearance matching loss is used to constraint the appearance fields. It determines whether the constructive or generative image match ground-truth in structure and texture. It is computed as

$$ \mathcal{L}_{app}^{r} = \Vert I_{rec} - I_{g} \Vert_{1} $$
(9)
$$ \mathcal{L}_{app}^{g} = \Vert I_{gen} - I_{g} \Vert_{1} $$
(10)

Since we use the prior distribution of known regions in Ic and sub-distribution of occlusion area in Im, KL divergence term is adopted in our network to regularize the sample function and fixed latent distribution according to [36]. In reconstructive path, posterior sampling function q is used, z is the latent vector, and is the Gaussians. KL loss is formulated as:

$$ \mathcal{L}_{KL}^{r} = -KL(q(z \vert I_{c}, P(Z)) \Vert \mathcal{N}(0, 1) ) $$
(11)

For the generative path, conditional prior and likelihood p is used:

$$ \mathcal{L}_{KL}^{g} = -KL(q(z \vert I_{c}, P(Z)) \Vert p(z \vert I_{m} )) $$
(12)

Overall, the total loss function consists of four groups:

$$ \begin{array}{@{}rcl@{}} \mathcal{L} &=& \lambda_{app} (\mathcal{L}_{app}^{r} + \mathcal{L}_{app}^{g} ) + \lambda_{KL} (\mathcal{L}_{KL}^{r} + \mathcal{L}_{KL}^{g} )\\ &&+ \lambda_{ad} (\mathcal{L}_{ad}^{r} + \mathcal{L}_{ad}^{g} ) + \lambda_{attr} (\mathcal{L}_{attr}^{r} + \mathcal{L}_{attr}^{g} ) \end{array} $$
(13)

where λapp, λKL, λad, λattr are hyperparameters. In our experiments, we set λapp = 20, λKL = 20, λad = 1 and λattr = 1.

4 Experiments

4.1 Implementation details

We evaluate our model on CelebA [13] which have the annotations for 40 attributes and 5 landmark locations.

Although the face has multiple attributes, we only need to focus on the occluded area. Note that our target is to remove mask via face attributes, we select 7 representative and generic attributes mainly in the masked region to reduce the impact of unique properties among all humans including BigNose (big nose), Chubby (chubby face), Makeup (makeup face), Male (gender is male), MouthOpen (open mouth), NoBeard (no beard) and Young (young face). Among these attributes, 1 is true and -1 is false according to the label in CelebA. During training, we use a pre-trained face detection model to distinguish the attributes of output face images, which scores between -1 and 1. If the score of an attribute is 0, it means that the attribute is difficult to distinguish and the result of image inpainting is not consistent with ground truth. We use this inference as the criterion for discriminators.

Besides, to our knowledge, there is no open-source datasets about-face wearing a mask by the time of this submission. Hence we made a synthetic facial occlusion that mimicked the real mask. Firstly, we use PRNet [2] to get face pose estimation and alignment. Secondly, we focus on the local key points based on face mask in reality and create a mask dataset shown in Fig. 3. Each face corresponds to a unique mask. Thus, we have the same amount of masks as training and testing images.

Fig. 3
figure 3

Our masks dataset. Due to each face corresponds to a unique mask as input, the masks in dataset have different shapes and sizes

Then we randomly select 180,000 training images and 2112 testing images from CelebA and get the corresponding attributes and mask image. Both face images and mask images are resized into 256 × 256 while face images are RGB images and mask images are grayscale images.

Our proposed model is implemented in PyTorch. The network is trained using 256 × 256 images with batch size as 8. We use the Adam optimizer [10] with learning rate as 10− 4.

4.2 Quantitative results

Similar to most inpainting works, we measure the quality of our results using the following metrics: 1) mean l1 loss; 2) peak signal-to-noise ratio (PSNR); 3) structural similarity index (SSIM) [27]. Although the rationality of pixels cannot represent semantic rationality, these metrics could measure the distortions of the results. Besides, we also introduce Fréchet Inception Distance (FID) [6] as one of our metrics. As FID calculates the Wasserstein-2 distance between two distributions, it can indicate the perceptual quality of the results. In this paper, we use the pre-trained Inception-V3 model [24] to extract features of real and inpainted images when calculating FID scores.

The results over CelebA are reported in Table 1. It can be seen that our approach achieves better performance against other methods in most metrics, which suggests that our method could produce inpainting results with higher quality. Although in FID, PIC [39] obtains a score slightly better than ours, it is worth noting that PIC can generate multiple diverse results by sampling various results. For comparison, we use the same way mentioned in [39] to obtain quantitative measures. And we will discuss and show visualized results in Section 4.3.

Table 1 Quantitative comparison with state-of-the-arts. ↓ means lower is better and ↑ means higher is better

4.3 Qualitative results

Next, we visually compare our model with previous state of the art methods [17, 35, 39]. Fig. 4 shows a sample of automatic inpainting results. Specifically, we use the pre-trained model of GC [35] and EC [17] because they did similar work on CelebA. As for PIC [39], we train and finetune their model in our mask dataset for a fair comparison.

Fig. 4
figure 4

The qualitative comparisons with existing state of the art methods on CelebA. Zoom in for a better view

It can be seen that the results of GC and EC suffer from artifacts because of the large and continuous mask as well as facial complexity. Although EC can infer edges of missing regions, our mask is a large-scale continuous occlusion and thus it is difficult for EC to infer such information and structures in the masked face. Compared with these two methods above, PIC can recover reasonable face features. However, masked faces lacks information about the mouth, nose, cheek, beard, and other details. So it is almost impossible for PIC to guarantee the correspondence of attributes between generated results and the ground truth.

In contrast, our method uses face attributes as part of the input to guide face inpainting, we can generate face more similarly aligned with real attributes as well as ensure semantic and structural rationality.

4.4 Ablation studies

We further perform experiments to study the effect of the components of our model, especially face attributes. Using the same pre-trained model, we change the test details where face attributes are abandoned. Besides, noticed that in (3), we alter the feature tensor that Zswap = [Zm,Zattr + Zc]. That is, masked region information is combined in the reconstructive path and no prior guidance can be used in the generative process. Although some features are shared, this way weakens the effect of face attributes. In the same way, we also maintain and abandon the face attributes in testing. The evaluation is shown in Table 2 and the results are shown in Fig. 5.

Table 2 Quantitative comparison in ablation studies. ↓ means lower is better and ↑ means higher is better. Besides, w/o means without face attributes and w/ means with face attributes
Fig. 5
figure 5

The qualitative comparisons in ablation studies. (From left to right) Ground truth, input masked face, results of ours, results of ours without attributes, results of Zswap with attributes, and results of Zswap without attributes. Zoom in for a better view

Our current method leads to significantly better results. It can be seen that without face attributes in testing, both semantic and photo-realistic features will be damaged. Although Zswap can get good visual effects, its characteristics cannot match ground truth well. In short, our current method can obtain competitive results not only in visualized results but also on evaluation metrics.

As mentioned in (7) and (8), we will show our face attributes evaluation of output face images. We also use our results in Fig. 5 to give an example. The evaluation is shown in Table 3. Compare with the results of the visualization, it can be seen that some features match labels well while some cannot be well classified by discriminator because of the ambiguous labels and classifier error, which inevitably affect the inpainting results. How to improve a certain attribute will be discussed in the future work.

Table 3 Face attributes evaluation. The attributes include BigNose, Chubby, Makeup, Male, MouthOpen, NoBeard and Young. Among them, value close to 1 means true while -1 means false

5 Conclusion

In this paper, we proposed a novel dual pipeline network for mask removal. From a practical perspective, we focused on face wearing a mask and used face attributes as input to guide the inpainting process. We showed that our network architecture and loss functions could use face attributes information and remove mask well. Besides, we built a mask dataset simulating the real occlusion effect of the mask. Experiments showed that our model could obtain competitive results compared with several state-of-the-art methods.