Multi-scale progressive blind face deblurring

Blind face deblurring aims to recover a sharper face from its unknown degraded version (i.e., different motion blur, noise). However, most previous works typically rely on degradation facial priors extracted from low-quality inputs, which generally leads to unlifelike deblurring results. In this paper, we propose a multi-scale progressive face-deblurring generative adversarial network (MPFD-GAN) that requires no facial priors to generate more realistic multi-scale deblurring results by one feed-forward process. Specifically, MPFD-GAN mainly includes two core modules: the feature retention module and the texture reconstruction module (TRM). The former can capture non-local similar features by full advantage of the different receptive fields, which facilitates the network to recover the complete structure. The latter adopts a supervisory attention mechanism that fully utilizes the recovered low-scale face to refine incoming features at every scale before propagating them further. Moreover, TRM extracts the high-frequency texture information from the recovered low-scale face by the Laplace operator, which guides subsequent steps to progressively recover faithful face texture details. Experimental results on the CelebA, UTKFace and CelebA-HQ datasets demonstrate the effectiveness of the proposed network, which achieves better accuracy and visual quality against state-of-the-art methods.


Introduction
Face deblurring is the task of recovering a sharp face with both edge structures and realistic details from the low-quality counterparts suffering from unknown degradation, such as different motion blur [35,45] and noise [62]. The degradation process is generally defined as: where I B , I S , K , and N represent the blurry image, sharp latent image, blur kernel and noise, respectively; * represents the convolution. Given I B , in face deblurring, the objective is to estimate the underlying sharp face image I S . Moreover, the deblurring techniques can be divided into non-blind and blind deblurring methods according to whether the blur kernel K is known [31]. The latter is more challenging than the former, because it is a typical ill-posed problem with infinite feasible solution [57]. Therefore, most researchers are currently working on blind deblurring techniques [7,13,53,57,61], and this paper is no exception. In addition, accurate and fast resolution of the deblurring problem is the key to computer vision and image processing, which can produce tremendous commercial value [5].
Deep learning has brought significant advances for general image deblurring tasks [7,8,31,32,59,61]. There are three main research trends [17] for this task: CNN-based methods, GAN-based methods, and prior-guided methods. SPARNet [4] introduced a facial attention unit and a spatial attention mechanism based on convolutional neural networks (CNN) to generate high-quality outputs. Kupyn et al. [31,32] proposed a deblurring method based on generative adversarial network (GAN) and demonstrated the potential of GAN for deblurring tasks. Inspired by the benefits of GAN, some progressive deblurring networks have also achieved great success for single image deblurring [7,8,61]. However, these multi-stage progressive networks lead to excessive network size and depth and difficulty maintaining a complex balance between spatial details and high-level background information [11]. Although the above methods achieved better performance for image deblurring, due to the particularity of face image itself, blind face deblurring has the following challenges [17,21]: (1) how to generate fine and realistic facial details; (2) how to achieve a good balance between visual quality and fidelity. Thus, researchers generally proposed two schemes based on Prior-guided to overcome these challenges. The first scheme considered utilizing face-specific priors on face deblurring, such as sparsity [43,44], patched similarity [47], face landmarks [2,6], face semantic labels [45,57], and face component heat maps [60], and showed the significance of these priors in face restoration. However, most of these priors inevitably suffer from degradation [49] estimated from low-quality inputs in realworld scenarios [28][29][30]. Although the above priors could guide facial recovery, they contain limited texture features for recovering facial detail information (e.g., hair texture, tooth contours, facial wrinkles) [56]. The second scheme is to build a reconstruction-oriented high-quality dictionary containing rich high-quality face priors for face reconstruction, leading to better reconstruction results [34,52]. However, due to the limited dictionary capacity, these methods would lose the richness and diversity of the reconstructed facial details [49]. Thus, restoring richer and faithful facial texture details with reduced reliance on priors becomes a new challenge for blind face deblurring [56].
In this paper, we propose a multi-scale progressive facedeblurring generative adversarial network (MPFD-GAN) to recover sharper faces without requiring extra inputs (i.e., face priors, facial component dictionaries). Its generator includes three parts: the encoding process, the center process, and the decoding process (pyramid reconstruction process). In the first part, we filter out noisy information and provide abundant contextual features and textural details to provide fine-grained features for subsequent steps. In the center part, we design a feature retention module (FRM). It captures broad contextual information and richer receptive field information by multiple dilated convolutions with different dilated rates. This could enhance and generate highresolution features with rich spatial details. Moreover, FRM adding channel attention between dilated convolutions could avoid artifacts caused by the fusion of multiple receptive field information. In the last part, we propose the texture reconstruction module (TRM) and plug it into all pyramid levels to enable progressively replenishing face texture details. It consists of supervised attention and facial feature-guided reconstruction. The former computes attention maps using the previous step recovered low-scale face with the guidance of ground truth and then reweights the input features to obtain fine features by these maps. The latter extracts the high-frequency texture information from the recovered lowscale face by the Laplace operator, which guides subsequent steps to recover more detailed face texture details.
Experimental results on three publicly available datasets [37, 38,65] illustrate that the proposed method can recover high-quality and detailed information-rich face images. The main contribution of this work can be summarized as follows: • We propose an effective blind face deblurring network called MPFD-GAN, which can recover multi-scale sharp faces from blurry faces without requiring extra inputs (i.e., face priors, facial component dictionaries). • We design a feature preservation module (FRM), which captures richer receptive field information to recover the complete structure from a blurry image, and propose the texture reconstruction module (TRM) which generates texture guidance with rich facial details to help reconstruct facial texture details. • Extensive experiments have demonstrated that our breakmethod achieves better visual effect and quantitative metrics than the state-of-the-art techniques on the CelebA [38], UTKFace [65] and CelebA-HQ [37] datasets.

Image deblurring
Earlier blind deblurring methods deconvolute the estimated blur kernel with the blurry image to obtain a sharp image [41,54,55]. Although these traditional methods have a certain deblurring effect, the deblurring process of each image takes a lot of time [12].
With the development of deep learning, more excellent algorithms have been applied to image deblurring tasks [7,8,32,59,61]. Kupyn et al. [31] proposed a GAN-based end-to-end image deblurring method named DeblurGAN. It uses PatchGAN [20] as the discriminator and optimizes the network using perceptual loss [25] and adversarial loss. However, these face images recovered in this way show a checkerboard artifact and a poor deblurring effect (Fig. 1b).  Subsequently, they proposed a new deblurring method named DeblurGAN-v2 [32] based on C-GAN [39]. DeblurGAN-v2 introduces the feature pyramid network as the core module to accelerate the speed of deblurring. However, their recovered face images looked quite inharmonious (Fig. 1c), and the experimentally obtained evaluation metrics also performed poorly ( Table 2). Ye et al. [59] proposed a scale-iterative upscaling network to recover sharp images. However, this method failed to achieve better visual quality on face datasets (Fig. 1e), and the iterative calculation requires a large amount of memory. Moreover, Arora et al. [61] proposed a multistage network architecture to solve the complex image restoration problem. In this method, they first learn contextdependent features using an encoder-decoder architecture and then fuse them with local information extracted by high-resolution branching with an attention mechanism. Horizontal connections are also added into the feature processing modules at each stage to avoid information loss. However, this approach will consume a lot of computational costs and cannot be popularized well in the real world [11]. Moreover, these recovered images are relatively coarse, causing a lot of information loss in key areas such as the mouth (Fig. 1g), and they cannot recover natural and harmonious face images. Recently, Cho et al. [7] have reviewed the image restoration scheme from coarse to fine and proposed a multiple-input multiple-output deblurring structure. This structure uses asymmetric feature fusion techniques to fuse feature information at different scales. However, this method cannot restore the complete structure and faithful details (e.g., lips) when facing heavily degraded images (Fig. 1h).

Face deblurring
Existing deep learning-based blind deblurring methods can deal well with blurry images in the real world [59]. However, various existing model architectures cannot solve the deblurring problem in all settings. In face deblurring tasks, it is usually necessary to use face-specific priors to guide the recovery of face images [30,45,67]. These priors are divided into two main categories: geometric (e.g., face semantic labels) and reference priors [49]. Recently, Yasarla et al. [58] proposed to help achieve better performance in face deblurring tasks using facial semantic labels. This method is divided into two stages: (1) the first stage is to feed the blurry face images into a segmentation network to obtain different semantic labels for faces; (2) the second stage is to feed the blurry face images and the semantic labels into a multi-stream semantic network to process regions belonging to each semantic category independently and learn to combine the information from different regions to output as the final deblurring result. But these semantic labels are estimated from blurred face images that suffer from severe degradation and inevitably degrade in realistic scenes. They mainly focused on geometric constraints and did not consider how to recover critical areas of the face [49]. The recovery results obtained by this method are still blurry in critical areas such as the hair (Fig. 1f). Instead, our proposed method does not rely on estimating geometric priors from degraded blurry face images.

Methodology
Our goal is to recover high-quality faces with richer authentic details through a multi-scale progressively deblurring method (MPFD-GAN) without requiring extra inputs (i.e., face priors, facial component dictionaries). Figure 2 illustrates the overview structure of our network, which contains two key modules: FRM and TRM. In this section, we first detail the critical modules of MPFD-GAN and then describe its loss functions. to produce the multi-scale restoration result I R and I i R . Local adversarial loss is adopted across the face region (labeled in green), while the remaining losses are adopted across the entire image (labeled in purple)

Feature retention module (FRM)
To retain broad contextual information without degrading spatial information, we design a feature retention module (FRM) between the encoder and the decoder (see the green block in Fig. 2). It consists of the dilated convolution blocks and channel attention blocks (CAB) [16] with dilation rates of 1, 2, 4 and 8, respectively (see Fig. 3).
Unlike the other dilated convolution blocks [66,68], we add CAB between each two successive dilated convolution blocks to avoid artifacts caused by the fusion of multiple receptive field information. This problem can be illustrated in Fig. 4, where FRM with CAB (third row) gets sharper feature maps than FRM without CAB (second row). Moreover, we add skip connection to ensure the proposed model makes full the information from the shallow features. In this way, our network can obtain richer receptive fields without changing the feature size and generates high-resolution features with rich spatial information. Formally, where F in and F out denote the input and output features of the FRM. Conv dr_i and F dr_i represent dilated convolution operations and their corresponding output features with different dilated rates (i = 1, 2, 4 and 8, respectively). C AB represents channel attention block (CAB), which can be represented by a figure and the following equations: where x ∈ R H ×W ×C represents the features obtained from the input x ∈ R H ×W ×C after a ResNet block [14], H × W denotes the spatial dimension and C is the number of channels. x c ∈ R H ×W represents the features of the cth channel of x . z ∈ R 1×1×C represents a one-dimensional vector composed of the feature weights of each channel. δ and σ are ReLU and sigmoid activation functions, respectively. W 1 and W 2 represent the parameters of the first and second fully connected layers, respectively. x ∈ R H ×W ×C represents the output of CAB.

Texture reconstruction module (TRM)
To enable the model to progressively replenish face texture details, we propose a texture reconstruction module (TRM) and plug it into every pyramid level (different scales in the reconstruction process, see Fig. 2). The structural details of the TRM are shown in Fig. 5, where we restrict the reconstructed image I i R at each pyramid level to be infinitely close to the pyramid of the ground truth image I i G (i = 1, 2, 3, 4) in terms of realism and fidelity. The contribution of TRM is twofold. First, we compute attention maps by the reconstructed low-resolution image supervised by ground truth and use these maps to reweight the input features F in to generate F S A containing more beneficial features. Second, we extract the high-frequency texture information from the reconstructed low-resolution sharp face I i R by the Laplace operator as facial texture guidance I i t.g . This facial texture guidance can provide sufficient texture information for the subsequent image reconstruction process and help recover the reality texture details of the face. As illustrated in Fig. 6, we can visually observe that the texture guide contains more and more facial texture information when the scale gradually increases. Finally, we fuse the features (F S A , I i R , I i t.g. ) following the channel attention block (CAB) [16] to suppress the less informative channels at the current pyramid level and only allow the beneficial ones to pass to the next step.
Formally, as shown in Fig. 5, TRM first estimates the residual image I i R−B ∈ R H ×W ×3 by convolving the input features F in ∈ R H ×W ×C with a 3 × 3 convolution kernel, where H × W represents the size of the features and C represents the number of channels. The element-wise summation of the residual image I i R−B with the input lowresolution blurry image I i B ∈ R H ×W ×3 can reconstruct the low-resolution sharp image I i R ∈ R H ×W ×3 . We provide the ground truth image of the corresponding size to constrain the reconstructed I i R ∈ R H ×W ×3 , which improves the fidelity of the reconstructed results. We further perform the convolution operation and sigmoid activation on I i R to obtain the corresponding per-pixel attention maps ∈ R H ×W ×C . It can help TRM to re-calibrate transformed features F in (after 3 × 3 convolution) and then get attention-augmented fea-tures F S A ∈ R H ×W ×C . This process can be expressed by the following equation: Subsequently, the high-frequency face texture guidance I i t.g. ∈ R H ×W ×3 is extracted from I i R by the Laplace operator (see Eq. 12). It can provide adequate texture information for the subsequent step to gradually restore higher-resolution sharp image I i+1 R .
where I i R (x, y) denotes the pixel value on the location (x, y) in the image I i R . We set the padding parameter in the Conv2d function [42] to 1 to ensure that x + 1, x − 1, y + 1, and y − 1 do not cross the boundary.
Finally, we splice the various efficient features obtained earlier, suppress channels with less information in the current scale by the channel attention block (CAB), and pass the beneficial channel information into the next step as the output F out ∈ R H ×W ×(C+6) . Formally:

Dual discriminators
Inspired by the work of Zhang et al. [64], we design dual discriminators based on relativistic GAN [26]. It consists of a global discriminator and a local discriminator for the facial region. The global discriminator constrains the overall spatial consistency, while the local discriminator provides fine-grained facial feature distribution to restore photorealistic and harmonious faces. Specifically, both the recovered sharp image I R and the ground truth image I G are passed into the global discriminator to judge the realistic of I R (see Eq. 16). Then we feed the face regions extracted from I R and I G to the local discriminator to judge the authenticity of the recovered face regions (see Eq. 17). At the beginning of training, the global discriminator can ensure that I R and I G are consistent in the overall structure. When I R approaches I G infinitely in spatial, the local discriminator conduces to recover edges and detailed textures of the facial region of I R .

Loss function
The training objective of our method is achieved by minimizing the total loss that consists of: (1) reconstruction loss constraints in the restoration results to the pyramid of the ground truth image; (2) adversarial loss for restoring image details and facial realistic textures; (3) edge loss further enhancing the quality and visual realism of facial details; and (4) identity preservation loss to protect the original identity information of the input image.

Reconstruction loss
To obtain recovery results on different scales and strengthen the deblurring ability, both pyramid reconstruction loss and perceptual loss are used. We found that using the robust Charbonnier loss [3] form better handles outliers and improves performance. The reconstruction loss is defined as follows: where I R and I G represent the original size recovery result and the corresponding ground truth, respectively. I i R and I i G represent the low-resolution outputs and their corresponding ground truth (with 1 2 5−i times the original size). ∅(·) denotes the pretrained VGG19 network [46] with ImageNet [9] and we use the first 5 feature maps of maxpooling layers (after activation) [31]. λ pyramid and λ per represent the loss weights of the pyramid reconstruction loss and perceptual loss, respectively. We empirically set = 1e − 3 in all experiments [61].

Adversarial loss
We employ relative adversarial loss [50] to recover sharper contours and detailed texture. Meanwhile, for facial regions, we introduce a local relative discriminator to enhance the model's perception of the facial area. The adversarial loss for the generator is defined as follows: where L G global_adv and L G local_adv represent the global adversarial loss and the local adversarial loss for the generator, respectively. They are defined as follows: (d) I 4 t.g.
where D Ra (I G , is the relativistic average discriminator [50], σ (·) is the sigmoid function and D(·) represents the output non-transformed of the discriminator. E I G [·] represents the operation of averaging all ground truth in the small batch. (·) represents face area extractor, i.e., the Dlib C++ library in our implementation. The adversarial loss for the discriminator is in a symmetrical form:

Edge loss
Recent studies have found that adding auxiliary functions in addition to the reconstruction loss would get better deblurring performance [23]. Therefore, we consider edge loss to constrain the differences between frequency spaces so that the final output recovers more realistic high-frequency details. Formally: where (·) represents the edge map extracted from the image by the Laplacian operator [22].

Identity preservation loss
To improve the fidelity and authenticity of identity characteristics while deblurring, we refer [18,33] to apply identity preservation loss to our method. We first extract facial features from recovered faces and corresponding ground truth using the pre-trained ArcFace model [10], and calculate the cosine distance between these features as identity preservation loss L id . Formally: where cos(·, ·) is the cosine similarity of two vectors, and ϕ(·) represents the face feature extractor, i.e., ArcFace [10]. Restoration multi-scale result by generator:

Algorithm 1 MPFD-GAN Algorithm.
Compute the losses: (see Eqs. 14, 20 and 21 ) Discriminate the distribution of I i R real or fake: (see Eqs. 16 and 18 ) Discriminate the distribution of face in I i R real or fake:

Datasets
We separately conduct experiments on three publicly available face datasets to demonstrate the effectiveness of our method: (1) high-resolution face dataset (image size: 256 × 256), which consists of 29,996 blurry face images generated using the CelebA-HQ dataset [37]; (2) middle-resolution face dataset (image size: 192 × 192), which consists of 23,708 blurry face images generated using the UTKFace dataset [65]; (3) low-resolution face dataset (image size: 160 × 160), which consists of 196,973 blurry face images generated using the CelebA dataset [38]. As shown in Table 1, we respectively split the training, validation, and testing datasets according to previous work [57,58,61].
To simulate a real blurring scene, we use a blur kernel of size 25 × 25 to blur the sharp image (CelebA-HQ) with a motion angle of 45 degrees and then add random white Gaussian noise [19]. In addition, we generate 22,127 pairs of clean blurry data based on the UTKFace dataset following the approach of Yasarla et al. [58]. To further demonstrate the performance of our network in realistic scenarios, we applied a more complex blur to the CelebA dataset by referring to the ideas of Boracchi et al. [1]. Specifically, we use the Markov process to generate random motion trajectory vectors [24]. Next, we perform sub-pixel interpolation on the trajectory vector to gain motion blur kernels of size from 13×13 to 29× 29 [31]. Finally, we randomly adopt one to three blur kernels to blur the original images and add random white Gaussian noise to obtain a blurred face dataset that is infinitely close to the real world. This approach can simulate the sudden movements that occur when people press the camera button or try to compensate for camera shake [1].

Implementation
MPFD-GAN is an end-to-end learned method for blind face deblurring. It does not require any pre-trained model to generate facial prior. We train separate models for two different datasets with the following settings. The training batch-size is set as 8 and 14 on the CelebA-HQ and CelebA datasets, respectively. We augment the training data with horizontal flip and use AdamW [27] as the optimizer for a total of 500 epochs. Furthermore, the initial learning rate is set as 1×10 −3 and gradually decreased to 1×10 −4 using the Multi-Step LR strategy [49]. With the PyTorch framework [42], all experiments implement on a GeForce RTX 2080Ti GPU.

Comparison with state-of-the-art work
To verify the effectiveness of our method on blind face deblurring tasks, we compare MPFD-GAN with several state-of-the-art methods: DeblurGAN [31], DeblurGAN-v2 [32], DMPHN [8], SIUM [59], UMSN [57], MPRNet [61] and MIMO-UNet [7]. For a fair comparison, we used their published official codes and completely followed their experimental setup. We performed a quantitative and qualitative comparison of the test results of all methods on the CelebA, UTKFace and CelebA HQ datasets, respectively.

Quantitative comparison
Like similar tasks [57,61], we adopted non-reference perceptual metrics FID [15] and NIQE [40] to measure the realness. As for the facial fidelity, we adopt perceptual metrics (LPIPS [63]) and pixel metrics (PSNR and SSIM [51]). LPIPS compares the difference between image patches, PSNR measures the distance between pixels, and SSIM assesses similarity between structure, contrast and luminance. To verify the ability of different methods to recover identity features, we also calculated the face similarity between the recovered results and the corresponding ground truth using cosine similarity as in Deng et al [10]. Furthermore, we first introduce the mean normalized error (MNE) [48] into face deblurring tasks to evaluate the recovery performance of facial contours (according face key point offset distance). Formally: where I R (i) is the coordinates of the ith face key point and I G (i) is its corresponding ground truth. d io and N denote the distance between the eyes and the number of key points of the face, respectively. N is set to 68 in this article. The smaller value means that the face contour is closer to the ground truth. Tables 2, 3 and 4 show the quantitative results of our method and the state-of-the-art deblurring methods on the CelebA, UTKFace and CelebA-HQ datasets, respectively. We can see that our method obtains the lowest LPIPS on the CelebA and CelebA-HQ datasets, which indicates that our output is perceptually closest to the ground truth. Our method achieves the lowest FID and NIQE, which indicates that the output of our method has a small distance from the Bold and underline indicate the best and the second best performance, respectively. FS. represents the face similarity. "↑" denotes higher is better, and "↓" denotes lower is better Bold and underline indicate the best and the second best performance, respectively. FS. represents the face similarity. "↑" denotes higher is better, and "↓" denotes lower is better Bold and underline indicate the best and the second best performance, respectively. FS represents the face similarity. "↑" denotes higher is better, and "↓" denotes lower is better on the CelebA-HQ (first three lines) and UTKFace (last three lines) dataset for face blind deblurring. Our MPFD-GAN produces faithful texture details in eyes, veils and teeth FS. represents the face similarity. "↑" denotes higher is better, and "↓" denotes lower is better L pyramid , L edge , L id denote pyramid restoration loss, edge loss, and identity preservation loss, respectively. FS represents the face similarity. "↑" denotes higher is better, and "↓" denotes lower is better  Table 5 natural image distribution and the realistic face distribution, respectively. Our method also obtains the highest PSNR and SSIM, which indicates that our output results are closest to ground truth at the structure and pixel level. Furthermore, the face in our output results has the highest similarity to the ground truth (shown in Tables 2, 3 and 4 ), demonstrating that our method recovers better identity information. Our MPFD-GAN also obtains a lower MNE (slightly higher than the SIUN method on CelebA-HQ), which indicates that our approach recovers the facial contour closer to the ground truth. Figures 7 and 8 show the qualitative results of the various methods on CelebA-HQ, UTKFace and CelebA datasets. We can observe from Fig. 7 that our MPFD-GAN recovers realistic details such as eyes (eyelashes, etc.), facial decorations and teeth. That is due to TRM fusing detailed high-frequency texture information and filtering the feature information in space and dimension, letting the useful features pass to the next scale. As shown in Fig. 8, our MPFD-GAN recovers the complete structure and more realistic texture details to achieve the best visual quality on the CelebA dataset. Among them, the recovery results obtained by DeblurGAN have significant checkerboard artifacts (see Fig. 8b). The deblurred output obtained by DeblurGAN-v2 and SIUM also has significant ghosting and no good visual performance (see Fig. 8c, e). The face images recovered by DMPHN and UMSN suffer from facial disharmony (see Fig. 8d, f). In addition, we found that MPRNet and MIMO-UNet are not effective at deblurring face images that suffer from high blurring influence (see Fig. 8, rows 3 and 5). They also fail to remove blur in local details (e.g., eyes, teeth, etc.) and structures in enlarged image regions(see Fig. 7g, h).

Ablation studies
To better understand the roles of different components of MPFD-GAN and the training strategy, we conduct an ablation study by introducing some variants of the proposed method and comparing their blind face deblurring performance in this subsection. All ablation experiments perform on the CelebA dataset.

Ablation on network architecture
As shown in Table 5 (configuration a) and Fig. 9a, we can observe that the recovered face images lose facial details and get poor evaluation metrics when replacing the SE-ResNet block in MPFD-GAN with the ResNet block. The results indicate that the SE-ResNet block is essential for MPFD-GAN to extract effective features from blurry images. When FRM and TRM are removed separately (configuration b and c), we can observe that (1) the face images in the recovered results lose a lot of details in the eye region, causing the eyes to remain blurry (see Fig. 9b, c); (2) the performance of both perceptual metrics and pixel-wise metrics in Table 5 (configuration b and c) is degraded. These comparison results demonstrate that FRM and TRM make very significant contributions in the deblurring process of MPFD-GAN. Finally, we compare the experimental results with and without the facial region discriminator (see Table 5; Fig. 9d). The results show that the local discriminator can prompt the model to better restore the distribution of facial regions and effectively recover realistic details.

Loss function for ablation studies
At the same time, we further investigate the contribution of different loss terms by adjusting the weight of each loss in Eq. (22), and the results are shown in Table 6. When we remove the pyramid restoration loss, the performance of each metric obtained through the experiment decreases. This experimental result indicates that pyramid restoration loss enhances the recovery ability of MPFD-GAN to form blurry images. This intermediate supervision, which is helpful to recover sharp face images (multi-scale) in each scale of upsampling, also enhances the overall deblurring performance of the model. When MPFD-GAN removes the supervision on realistic texture details (edge loss) and face features (identity preservation loss), it will result in the final recovered images not giving the best performance (see Table 6). All the above ablation experiments again demonstrate that the design scheme of MPFD-GAN and our proposed individual modules are very effective for the blind face deblurring task.

Conclusion
In this paper, we propose a multi-scale progressive deblurring network named MPFD-GAN to deblur the face image unknown degradation without requiring extra inputs (i.e., face priors and facial component dictionaries). This approach mainly includes two core modules: FRM and TRM. The former can explore multi-scale receptive field information to help MPFD-GAN recover the complete image structure. The latter prompts MPFD-GAN progressively reconstruct facial texture details by fusing high-frequency texture information. Comparative experiments on the CelebA, UTKFace, and CelebA-HQ datasets demonstrate the superiority and robustness of our MPFD-GAN. Ablation experiments further verify the effect of core modules (namely FRM and TRM) of MPFD-GAN for the above tasks. In conclusion, MPFD-GAN provides a robust and easy-to-use solution for face deblurring task.

Future work
In future work, we consider employing MPFD-GAN to solve the challenging task of blind face restoration. The task demands recovering high-quality faces from the low-quality counterparts suffering more complex unknown degradation [5], such as low-resolution, compression artifacts, color fading, and lossy compression. To this end, we performed some simple experiments to test the potential of our MPFD-GAN for this task. Following the setting of the experimental data by Li et al. [34], we retrain the proposed model and UMSN [57] on the FFHQ dataset [28] and test both methods on three real-world datasets: CelebChild-Test [49], LFW-Test [28] and WebPhoto-Test [49]. The experimental results are shown in Fig. 10 and we can see that MPFD-GAN achieves better visual performance than UMSN [57] on real-world datasets. The above experiments are just a simple attempt of MPFD-GAN for blind face restoration. We will do more experiments in the future to solve the task well. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.