A survey on facial image deblurring

When a facial image is blurred, it significantly affects high-level vision tasks such as face recognition. The purpose of facial image deblurring is to recover a clear image from a blurry input image, which can improve the recognition accuracy, etc. However, general deblurring methods do not perform well on facial images. Therefore, some face deblurring methods have been proposed to improve performance by adding semantic or structural information as specific priors according to the characteristics of the facial images. In this paper, we survey and summarize recently published methods for facial image deblurring, most of which are based on deep learning. First, we provide a brief introduction to the modeling of image blurring. Next, we summarize face deblurring methods into two categories: model-based methods and deep learning-based methods. Furthermore, we summarize the datasets, loss functions, and performance evaluation metrics commonly used in the neural network training process. We show the performance of classical methods on these datasets and metrics and provide a brief discussion on the differences between model-based and learning-based methods. Finally, we discuss the current challenges and possible future research directions.


Introduction
Facial image deblurring is a technique used to recover clear facial images with sharp textural details from blurry facial images.Blurry images are widespread in life, which can be caused by various reasons, such as optical aberrations, camera shake, and object movement.The two most common types of blur are motion and defocus blur.An example of blurry facial images caused by these two types is shown in Fig. 1.
With the development of technologies such as face recognition, the processing of degraded facial images has become an important research topic.In surveillance or on some open occasions, the faces in the images are prone to motion blur, which can greatly reduce the performance of systems such as face recognition and video surveillance.Therefore, facial image deblurring has become a significant research topic in computer vision, and an increasing number of researchers have conducted studies on facial image deblurring.

Problem definition
Blur can be caused by various reasons, and we can represent it by the unified blur operator K.Moreover, considering that there may be noise in the image degradation process, we use the following model to represent the degradation process: Here, Y refers to the degraded image, X refers to the clear image, K refers to the blur operator, and n refers to the noise.In the degradation process, we consider noise to be additive.In addition, many methods simplify the blur degradation process to a linear function and assume that the blur in an image is spatially invariant; therefore, Eq. ( 1) can be rewritten as the following simplified model: arXiv:2302.05017v2 [cs.CV] 16 Mar 2023 where k refers to the blur kernel and * refers to the convolution operation.Image deblurring is used to obtain a clear image X from a blurry image Y .According to the different blurring causes, the blur kernel k can be modeled in different forms, as described below.
Defocus blur [1].When the object is not in the focal plane of the camera or the scene has a short depth of field, unfocused areas can produce unclear details and textures.The point spread function (PSF) of out-of-focus blur can be represented by the following model: where r is the radius of the blur and (i, j) refers to the pixel coordinates.
Motion blur.This is caused by the motion of the camera or the movement of the object.When the camera moves in a fixed direction at the moment of shooting, the blur kernel can be modeled using Eq. ( 4), where L is the motion length and θ is the motion angle.However, actual motion blur is more complicated than this one.On the one hand, the direction and degree of movement of the camera can vary; on the other hand, in a dynamic scene, only some objects move and others are stationary.Several methods have been proposed to generate simulated blur kernels [2,3].However, how to ensure the diversity of the simulation kernels and the authenticity of the distribution is still a problem.
Gaussian blur.Many methods use a simple Gaussian function in the model to represent the blurring process [4,5].The Gaussian kernel function can be represented by the following model: where σ is the standard deviation that indicates the degree of blur.

Scope of this survey
Facial image deblurring is a domain-specific image deblurring problem.The corresponding solution is improved and developed by using general deblurring methods.Some reviews [6,7] summarize the existing general deblurring models.Li [6] provided a brief summary of traditional and depth-represented image deblurring methods, whereas Zhang et al. [7] focused on the detailed introduction of deep learning-based image deblurring methods.
In contrast to these studies, this survey focuses on summarizing the deblurring research conducted on facial images.As a special application scenario, there are fewer textures and edges on facial images than those on general scene images; therefore, the proposed general deblurring methods cannot produce good results on facial images.In addition, although the identity is changed, different faces are composed of fixed components that can be used as prior information to improve the performance of the general methods.Based on these, a number of studies have been conducted dedicated to the deblurring of facial images, which we summarize in this article.
Before 2015, the methods used for face deblurring were mainly model based.Some methods have been proposed to improve the recognition performance, whereas others improve general image deblurring methods in the spatial domain.Here, we mainly summarize the research conducted in the field of face deblurring after 2010.Since 2016, owing to the strong fitting ability of a convolutional neural network, methods based on deep learning have been gradually proposed and have achieved better performance.Therefore, we mainly introduce the deep learning-based methods in this survey.In the deep learning model, different methods aim to introduce sufficient priors into the model to alleviate the problem of fewer facial image textures.The taxonomy of the methods involved is shown in Fig. 2.
The rest of the paper is structured as follows.We will provide a brief introduction to the model-based methods in Section 2.Then, we summarize the learning-based methods in Section 3. Section 4 introduces the commonly used blurry-sharp facial datasets in learning-based methods.Section 5 lists the loss functions and neural network training strategies commonly used in various models.Section 6 lists the metrics used to evaluate the image quality.In Section 7, we compare the performance of existing typical methods (including the model-based and deep learning-based methods).Section 8 provides a brief discussion and a macro comparison of the main differences between the model-based and learning-based methods.Finally, in Section 9, we summarize the limitations of the current study and the future research directions.

Model-based methods
The earliest methods transform a blurry image into the frequency domain to solve this problem.Nishiyama et al. [8] addressed the blur degradation problem by learning the feature subspace of facial images in the frequency domain.Blurry images with the same PSF are projected into the same subspace, indicating that they have the same degree of blurring.At the inference time, the kernel of the blurry image is determined as the PSF of the nearest subspace.However, these methods can only deal with fixed blur degradation and cannot be generalized to real-world blurry images with complicated and nonuniform degradation.
Most model-based methods focus on solving problems in the spatial domain.According to the degradation model established by Eq. ( 2), obtaining clear facial images can be expressed as where the first term on the right-hand side of the equation is the fidelity term; x and y denote latent sharp images and blurry images, respectively; * denotes the convolution operation; and f (x) and g(k) are the regularization terms for the latent sharp image and the blur kernel, respectively.Because it is an ill-posed problem to simultaneously estimate the blur kernel and the sharp image from a blurry image, adding a regularization term is necessary to narrow the solution space.Eq. ( 6) is usually solved iteratively, that is, by estimating the blur kernel and recovering the latent image iteratively.The main idea of traditional image deblurring methods is to restore salient edges implicitly or explicitly to estimate blur kernel k.
However, in facial images, only a few edges are available for blur kernel estimation.Therefore, existing general deblurring methods cannot achieve satisfactory results for blurry facial images.
Zhang et al. [9] adopted the sparse prior for the regularization of the latent image and the L 2 norm prior for the regularization of the blur kernel.Moreover, a sparse representation prior was added to the face recognition process, and then clear facial images were reconstructed by jointly optimizing the process of face restoration and recognition.By combining these two tasks, the authors were able to demonstrate a significant improvement over treating them individually.However, this method is only effective for facial images with good face alignment and simple motion blurring.Given a specific patch of facial images, there are many similar non-local patches near it.Using this feature, Tian et al. [10] introduced the earliest weighted non-local self-similarity [11] method for denoising into a sparse representation model and verified its effectiveness in face deblurring.Subsequently, Anwar et al. [12] proposed class-specific priors by transforming the images of a specific class into a Fourier space.Specifically, they learned a subspace spanned by the filter responses of sharp images in each class to a bandpass filter.In this manner, the method achieves improved results when dealing with blurry images lacking high-frequency details.
Zhang et al. [13] found that, in the iterative optimization process, some pixels of the intermediate restored image did not satisfy the model of Eq. ( 2), which was unfavorable for the next kernel estimation process.They proposed a pixel screening method to correct intermediate images and screen out bad pixels to facilitate more accurate kernel estimation.Tian and Tao [14] updated Eq. ( 2) by redefining the blur kernel and latent sharp images.They represented the PSF as a linear combination of a set of predefined orthogonal PSFs.Similarly, the estimated intrinsic (EI) sharp facial image was represented as a linear combination of a set of predefined orthogonal facial images.The coefficients of PSF and EI were jointly learned by minimizing the reconstruction error.Finally, they used a blind image quality assessment [15] method to automatically select the best images.In addition, some methods using reference images have been developed.Hacohen et al. [17] proposed a facial image deblurring method that uses a reference image, which has the same scene as that of the original blurry image.The function of the reference image is twofold.On the one hand, information from reference images can facilitate the process of kernel estimation.On the other hand, it can serve as a strong local prior for nonblind deconvolution.The algorithm performs well in deblurring certain classes of images.However, the use of the same reference image as the input observation has certain limitations in practical applications.Subsequently, Pan et al. [16] proposed a new method that does not require the sample and test images to have the same identity and background.The authors constructed an example dataset of facial images from the CMU PIE dataset and extracted important structures for each image, including facial contours, eyes, and mouths.During the test, the test image was compared with the structure of the sample and the best match was found.Fig. 3 shows an example of the matched facial structures given in the paper.This structure is used to recover the image edges and guide the blur kernel estimation process.After the face edge is obtained, also according to Eq. ( 2), the blur kernel is estimated alternately and the latent image is restored.The face structure information in this exemplar library can help extract face edges and eliminate the phenomenon of ringing and artifacts in traditional edge selection methods.However, matching each image to an example dataset is computationally intensive.Processing a single image takes several hours.Moreover, this type of approach tends to be inaccurate for face poses and angles that are not present in the dataset.

Methods based on deep learning
According to Zhang et al. [7], we can conclude that there are some commonly used basic blocks in neural networks for image deblurring, such as a convolution block, ResBlock, Inception block [18], and DenseBlock.In addition, there are some commonly used network architectures such as U-Net [19], multi-scale networks [20], generative adversarial networks (GANs) [21], and cascade networks [22].U-Net consists of an encoder, a decoder, and skip connections that can perform image transformation in an end-to-end manner.The multi-scale network feeds the original size image and the downsampled low-resolution image into the network.It first performs image restoration on a small scale and then restores the images up to their original sizes.It performs image deblurring in a coarse-to-fine manner, which greatly increases the computational load of the model.Similarly, cascaded networks can generate higher quality images owing to the concatenation of the networks.GANs can generate diverse and realistic facial images; however, the disadvantage is that sometimes the generated sharp facial image is not identical to the corresponding blurry image.
To reduce the ill-posedness of the problem, Jin et al. [23] improved the basic block.They advocated the expansion of the range of receptive fields in ResNet [24].They employed a resampling convolution operation that ensured a wide receptive field (RF) from the first layer while being computationally efficient.The advantages of using a large receptive field can be analyzed in two ways.First, the last Fig. 4 Example of the first class of methods [25].The blurry input first goes through a neural network to generate an estimated sketch.The traditional optimization processes of kernel estimation and deconvolution are then performed on the basis of the gradients of the sketch and blurry input; finally a clear output is obtained through the de-artifacts network.
few layers of the network can take advantage of higher abstract levels to better understand the image content and ultimately produce better deblurred outputs.Second, a larger RF contains more structural information.Moreover, more erroneous latent image-kernel pairs can be excluded compared with using a small RF with less structure, thereby reducing the ill-posedness of the system.To avoid artifacts, the authors concatenated a hybrid subnetwork and a deblurring network by applying several convolutional layers to process local images.
Many models use a combination of the above network architectures and basic blocks to achieve good restoration of blurry faces.Next, we will summarize the deep learning methods of facial image deblurring separately according to whether paired training data are required.Additionally, we provide a summary of these methods in Table 1.

Supervised learning
Some deep learning methods have been developed for image deblurring and applied to facial images.These methods can be roughly divided into three categories: The first class of methods first utilizes a neural network to estimate the blur parameters and then restores sharp images in model-based image deblurring frameworks.Chakrabarti [2] utilized a convolutional neural network (CNN) to compute the complex Fourier coefficients of the deconvolution filters of each image patch, followed by nonblind deconvolution.Lin et al. [25] utilized U-Net to extract face sketches for blur kernel estimation and nonblind deconvolution.Its architecture is shown in Fig. 4.
The second class of methods uses multiple subnetworks to improve the deblurring results.Schuler et al. [22] used a cascaded network structure to iteratively perform the kernel estimation and deconvolution processes.The algorithm adopted a coarse-to-fine strategy similar to that used in model-based deblurring methods.However, the network does not generalize well to different and diverse kernel sizes.Xu et al. [26] constructed two subnetworks for image deconvolution and artifact removal.Their architectures are shown in Fig. 5. Chrysos et al. [27] adopted a two-step architecture to perform facial image deblurring.The first step uses a high-performance hourglass network to recover the low and medium frequencies of the image.The second step recovers the high-frequency details of the images by training a conditional GAN while ensuring that the output images are close to natural images.
The third class of methods uses an end-to-end learning approach to model the entire restoration process.Nah et al. [20] proposed a multi-scale CNN to directly perform image deblurring without an intermediate kernel estimation step.The architecture of their model is shown in Fig. 6.Chrysos et al. [28] were the first to explicitly use deep architectures for face deblurring.They improved the classic ResNet architecture to perform end-to-end face deblurring tasks.In these architectures, face alignment techniques are used to preprocess each face and weak supervision is applied to exploit the structures of the faces.However, during testing, preprocessing of the blurry input is prone to errors, resulting in poor subsequent deblurring.Wang et al. [29] stacked multi-scale Inception modules in a residual manner to perform face deblurring.The convolution kernels of different sizes of the Inception [18] module can deal with different blurring degrees of the input image; however, they are memory intensive.Qi et al. [30] employed a GAN architecture to explore its specific effects on face deblurring.The authors adopted an improved U-Net and a feature enhancement module as the generator.Through careful design of the basic network blocks, enhanced feature representation, and adversarial training, the proposed method can generate more realistic faces.
There are a few edges and textures in the facial images.When we only input the blurry-sharp face pairs to the model, the output deblurred facial images are not sufficiently clear, Fig. 5 Example of the second class of methods [26].This model performs deconvolution and de-artifacting sequentially through two network modules.

Fig. 6
Example of the third class of methods [20], which is the end-to-end deblurring neural network.This method first recovers the low-resolution blurry image B3 and then upsamples the result into the deblurring framework of higher-scale B2.Finally, the original scale image B1 is restored.
with artifacts and poor details.Based on the above three types of methods, many methods have explored the effectiveness of prior information in the face, such as semantics, for deblurring tasks.Although the different faces are very multifarious, they are composed of fixed components.To fully use the face features, many methods focus on extracting specific information from facial images as a prior to guide the network training.The priors proposed by recent studies are facial geometry priors (such as face landmarks [28], face parsing maps [4,[31][32][33][34], and 3D facial models [35,36]) and deep feature priors [5,36,37].
Shen et al. [31] first attempted to incorporate the semantic parsing information of facial images into a network as a prior.The architecture of their model is shown in Fig. 7, which is a typical model with a face semantic map as a prior.The authors adopted the deblurring network proposed by Nah et al. [20].They obtained the semantic map of the blurry facial images through the semantic segmentation network and concatenated the semantic map and the blurry image in the channel dimension as the input of the deblurring network.To enhance the local performance, such as on the eyes, eyebrows, and mouth, Shen et al. also proposed a local structure loss to constrain the local output of the network.However, when there is severe motion blur in the input image, the semantic maps extracted from the blurry image are likely to be incorrect.Consequently, the authors improved their work in 2020 [33].To extract the correct face parsing map, Shen et al. added a coarse deblurring network before the face parsing network to reduce the blur in the input image.Then, the face parsing network extracts the semantic maps from the coarse deblurred images.Finally, the fine deblurring network recovers sharp facial images from the given blurry input images, coarse deblurred images, and corresponding semantic maps.Another improvement is that, in the local structure loss of the face components, different weights are set for each face component separately, instead of using fixed weights for all components.This adaptive local structure loss can adjust the weights and recover the fine details on the basis of the size of each facial component.
In the same year as the work of Shen et al. [31], Song et al. [4] exploited semantic information to jointly perform face super-resolution and deblurring.The model mainly consists of two modules.The first module is a deep CNN called the facial structure generation network.The network takes the degraded image and semantic masks as inputs and predicts a base image containing the basic structure of the face.To enhance the facial details in the base image, the authors developed a detailed enhancement algorithm using high-resolution (HR) example images.In the first step, the patch correspondence between the Fig. 7 Use of a face semantic map as a prior for the deblurring neural network [33].This method first obtains the semantic labels of blurry facial images through a face parsing network.Then, the semantic map is concatenated with the blurry image in the channel dimension as the input of the deblurring network.
base image and the HR example image is established.Then, the base image is regressed using the patches from the HR example image and an intermediate result is obtained.In the second step, the details of the intermediate result are passed to the base image through an edge-preserving filter to obtain the final result.The establishment of the detail enhancement module ensures that, even if the semantic mask extraction of the input blurry image is inaccurate, a real face with detailed edges can be recovered.
However, some facial components, such as the eyes, eyebrows, and mouth, cannot be reconstructed well because of the small face area.Yasarla et al. [32] noticed a class imbalance in the semantic map of faces.They proposed a confidence measure-based multi-stream semantic network.In the first stage of the network, different streams are used to complete the semantic map feature extraction of each class.The extracted features are concatenated together and fed into the second stage and used as input together with the semantic map fusion.In the second stage of the network, class-specific residual feature maps are learned using the nested residual learning strategy.Furthermore, the residual feature maps added to the overall blurry image are estimated.The authors introduced a confidence parameter for each class to measure its importance.This parameter is predicted using a separate confidence network and is added to the loss function to reweight the contribution of the loss of each class to the total loss.In this way, the trained network can offset the imbalance problem in the estimation of different classes, enabling it to better recover the details of small areas such as the nose and eyes.
When the semantic segmentation maps of faces are used, the accuracy of the segmentation can significantly affect the restoration performance.Moreover, it is very difficult to obtain accurate semantic segmentation maps from blurry input images.Zhang et al. [38] employed focal loss [39] to fine-tune a face parsing network to acquire a more accurate facial structure from a blurry image.They also proposed a separate normalization and adaptive denormalization block and a texture extractor to enhance the texture and facial details of the deblurred image.Lee et al. [34] proposed a method for learning facial component restoration without performing any segmentation.The proposed multi-semantic progressive learning (MSPL) framework was based entirely on the GAN structure.It introduces semantic coarse-to-fine progressive learning into a face deblurring task for the first time.In general, for image deblurring, progressive learning is mainly based on multi-scale network frameworks.In this model, the generator gradually generates face components in the order of skin, hair, interior (eyes, nose, and mouth), and, finally, the entire face.The multi-semantic discriminator of the model processes the multiple outputs of the generator, which allows the recovery of more realistic facial components at all intermediate layers.
In addition to face parsing maps, Lin et al. [25] used Fig. 8 Deblurring neural network using a 3D human face as a prior [35].The top green block shows how to reconstruct a 3D face by regressing the 3D morphable model [41] coefficients and render a sharp facial image.The bottom orange block focuses on face deblurring guided by the extracted identity vector and the rendered sharp face structure from the 3D face reconstruction branch.
face sketches as priors to guide the blur-kernel estimation process.Sketches of human faces can be used to model global relationships, which can be used as prior information to achieve better results than extracting local sharp edges.
The authors used a CNN to generate face sketches and used the learned sketches to guide the motion blur estimation module.Moreover, the generated sketches are also fed into the de-artifact subnetwork to encourage image reconstructions that preserve details and edges.Ren et al. [35] used 3D face priors [40] as the guidance information to achieve facial video deblurring.Their network model is illustrated in Fig. 8. Textured 3D faces are first generated for the center frame of the video using a 3D face reconstruction network that provides image-level (e.g., intensity with sharp edges) and perceptual-level (e.g., identity) information.The face deblurring network then applies the rendered, pose-aligned facial image as a guide to recover sharp faces.Furthermore, to encourage the generation of identity-related facial details during the deblurring process, the network further embeds the identity vectors extracted by the 3D reconstruction network into the deblurring network branch.Benefiting from the clear face structure estimated by 3D face rendering, the model can achieve good output results without using a multi-scale structure and certain tricks such as perceptual loss.
In 2017, Xu et al. [42] proposed the introduction of perceptual loss into a face deblurring network to guide the training of the network.The perceptual loss calculates the error between the deep features of two images.Its framework is shown in Fig. 9. Unlike Xu, Jung et al. [37] tried to guide the deblurring process by using the deep features of the image as a separate stream in the network.Its network architecture Fig. 9 Deblurring neural network using deep features as the network loss [37]."GT" indicates "ground truth".is shown in Fig. 10.Both face geometry and texture information are included in the deep features.In addition, a channel attention feature discriminator was proposed to assign different importance levels to different channels of the deep feature, thereby enabling the generator to focus on more important channels.
In 2021, Wang et al. [5] leveraged diverse generative facial priors for blind face restoration.This method also uses the deep features of images as prior information; the difference is that it uses StyleGAN2 [43] to extract the deep features.The input image is mapped to the closest latent code using a multilayer perceptron.The latent code is inputted into the StyleGAN2 network, and its intermediate features are extracted as generative prior information, which are modulated to guide the network to obtain more realistic results.Zhu et al. [36] proposed an adaptive feature fusion block to synthesize the shape feature priors and deep feature priors of images.The shape feature prior of the image is acquired using a 3D face reconstruction module, and the acquisition of the deep feature prior is similar to that in [5], which is acquired using StyleGAN2.However, these two methods focus on the comprehensive degradation of facial images including blur, noise, and JPEG compression.However, they are not designed specifically for deblurring.
In conclusion, end-to-end deblurring networks based on deep learning outperform traditional methods that require iterative optimization in terms of speed.Properly introducing semantic information into the network will improve the deblurring accuracy of the network.A common disadvantage of these methods is that they are only suitable for processing uniform motion blur and aligned facial images.These methods may fail for nonuniform blur or face profile images.

Unsupervised learning
On the one hand, the performance of deep neural networks strongly depends on a large number of paired training datasets; however, it is difficult to construct such paired training datasets.On the other hand, most of the current methods operate on synthetic training datasets.The learned network does not generalize well to real-world blurry images because of the large differences in the data distribution between different domains.To overcome these difficulties and limitations, some researchers focus on unsupervised methods for facial image deblurring.
In 2018, Madam et al. [44] proposed a GAN-based method for unsupervised image deblurring.They added a reblur loss and a multi-scale gradient loss to the model.Although they achieved good performance on synthetic datasets, their results for some real blurry images were not satisfactory.In 2019, Lu et al. [45] proposed an unsupervised facial image deblurring method based on disentangled representation.The model separates the content features and blur features from the blur images and normalizes the distribution range of the extracted blur attributes by enforcing the Kullback-Leibler (KL) divergence loss, thereby realizing the decoupling of the two features.Like in CycleGAN [46], an adversarial loss and a cycle-consistency loss are used as regularizers to help the generator network generate more realistic images while preserving the content of the original images.A perceptual loss is also added to remove artifacts in blurry images.Its model architecture is shown in Fig. 11.
In the same year, Xia and Chakrabarti [47] proposed a data-based unsupervised framework for training image estimation networks.This framework can be applied to a general class of observation models, where measurements are linear functions of real images accompanied by additive noise.The authors provided solutions for blind restoration and nonblind restoration, which can be used for face deblurring.In blind image restoration, a parameter estimation network and an image estimator are trained separately, and the network is guided to learn image deblurring from unlabeled data by defining a "swap-measurement" loss and a "self-measurement" loss.The training of this framework does not require paired training data, but requires two blurry images with different blur types for the same scene.Unsupervised learning methods lack corresponding ground-truth images as supervision information; therefore, the generated images are of low quality.However, these methods allow training on easily generated unpaired data as well as on B. Wang, F. Xu, Q. Zheng wild images that represent real blur; therefore, it is of great significance to develop unsupervised learning methods.

Datasets used for facial image deblurring
Depending on how the datasets are constructed, they can be divided into real-shot datasets and synthetic datasets.For real-shot datasets, blurry images are usually obtained by averaging the video frames or moving the camera through a specific trajectory.For facial image deblurring, most methods are trained on synthetic blurry-sharp image pairs.

Real-shot datasets
2M F 2 dataset [28]: This dataset was created by processing and extracting faces from videos containing facial images.It consists of 1150 videos containing 2.1 million frames of acceptable facial images, and landmark localization is implemented for each image.The authors later updated the dataset [27], expanding the number of videos and face identities in the images.The augmented dataset contains people of different ages and ethnicities and includes 19 million frames from 11,590 videos of 850 different identities, each of which appears in multiple videos.Moreover, the authors generated motion-blurred images by averaging multiple frames of the same face.
Lai et al. dataset [49]: Lai et al. collected real-world blurry images obtained in the wild.These images were captured by different users using different cameras and settings.There are 100 real blurry images, including natural, facial, and text images, none of which has any corresponding ground truth.Various face deblurring models and methods were tested and compared on this dataset to evaluate their performance on real blurry images.[31]: Shen et al. collected 2000, 2164, and 2300 clear facial images as ground truth from the Helen dataset [50], CMU PIE dataset [51], CelebA dataset [52], respectively.Then, 20,000 motion blur kernels were generated according to the method in [3], with sizes ranging from 13×13 to 27×27.The corresponding blurry images were synthesized by convolving clear images with these blur kernels and adding Gaussian noise.In total, 130 million blurry-sharp image training pairs were generated.The authors also constructed a test set that contained 16,000 images.It was generated by collecting 100 facial images from the validation set of the Helen and CelebA datasets and convolving them with another 80 generated motion blur kernels.[34]: Lee et al. used 30,000 HR facial images from the CelebA-HQ dataset [53] as clear ground truth and generated 18,000 motion blur kernels according to the method in [2].They convolved the sharp image with a kernel and added Gaussian noise.The generated images were split into two subsets: 24,183 image pairs for training and 5817 image pairs for validation.The authors also collected 240 clear images (80 in each dataset) from the CelebA , CelebA-HQ, and FFHQ datasets [54] and convolved them with 240 synthetic motion blur kernels for model testing.Two types of test data were generated: MSPL-Center and MSPL-Random.MSPL-Random was generated by random rotation, cropping, and horizontal flipping of MSPL-Center.

MSPL dataset
Jin et al. dataset [23]: Jin et al. collected and cropped 110,000 clear facial images from FaceScrub [55] and generated 10K random motion blur kernels according to the method in [2].Finally, white Gaussian noise was added to generate the training data.
Lin et al. dataset [25]: Lin et al. collected 2184, 2000, 2000, and 2400 clear facial images from the CMU PIE, Helen, CelebA, and PubFig datasets [56], respectively.These clear images were convolved with 20,000 motion blur kernels to generate blurry images.The blur kernels varied in size from 21×21 to 51×51.To increase the diversity of the data, the authors applied data augmentation, including random rotation, cropping, and scaling.
Many studies chose different datasets and blur kernel generation methods to simulate the training or testing of their own image pairs according to their specific application scenarios; these datasets are summarized in Table 2. Some of them are not available to the public and are used only for training their models.

Loss function
When a network is trained, the choice of loss function is critical to the final deblurring performance.In general deblurring models, content loss, perceptual loss, and adversarial loss are often combined to achieve better results.For domain-specific face deblurring, local structure loss, identity preserving loss, etc., have also been proposed to help improve the performance.
Content loss.Content loss, also known as reconstruction loss, is the most classic and commonly used loss function.Its goal is to measure the pixel-by-pixel difference between the output deblurred images and the ground-truth images.L 1 distance (mean absolute error) or L 2 distance (mean squared error) is often used to measure the difference.Its calculation None None Lai et al. [49] None None Shen et al. [31] Helen, CMU PIE, and CelebA [3] MSPL [34] CelebA-HQ [2] Jin et al. [23] FaceScrub [2] Lin et al. [25] CMU PIE, Helen, CelebA, and PubFig [3] Madam [44], Lu [45], and Qi [30] CelebA [2] Wang et al. [5] FFHQ Gaussian blur kernel Xia [47] and Yasarla [32] Helen and CelebA [2] Song et al. [4] Multi-PIE [57] and PubFig Gaussian blur kernel the pixel values of the deblurred output image should be as close to the ground truth as possible.However, this method of comparing differences pixel by pixel is relatively simple and crude, and does not consider human subjective visual perception.Adopting content loss tends to lead to oversmoothed output results.Some methods [27,44] also incorporate a first-order gradient distance between the deblurred image and the ground-truth image in the content loss to remove undesired ringing artifacts at image boundaries.
Perceptual loss.A visually pleasing result cannot be achieved using content loss alone.Perceptual loss has been widely used in applications, such as style transfer and image super-resolution, to enhance the perceptual effect.It aims to compare the distance between the deblurred and sharp images in a high-dimensional feature space to measure their feature similarity.Perceptual loss can be written as where φ l represents the output of layer l of a pretrained network.The authors in [4,25,[32][33][34] used the outputs of the pool2 and pool5 layers of the pretrained VGGFace network, whereas those in [5,30,37] used the outputs of conv3_3 and other layers of the pretrained VGG-19 network as features.The low-level features in the pretrained network contain edge, color, brightness, and other information, whereas the high-level features contain texture and rich semantic information.When perceptual loss is used in the network, the details and texture of the output image can be enhanced.For example, the edge parts of the image will be sharper, which compensates for the over-smoothed output results caused by content loss.
Adversarial loss.To make the output facial images as realistic as and as close to natural images as possible, researchers have developed various GAN-based methods.GAN-based networks define the problem as a min-max optimization process to make the output image of the generator look closer to the real face.The training objectives of generator G and discriminator D are opposite, and they generate real images by competing against each other.Adversarial loss is calculated as follows: where I B represents the blurry facial image.Through the competition between the generator and the discriminator, the adversarial loss will be able to make the output image of the network look as close to the real face as possible.However, training with only adversarial loss may result in the loss of image details (such as the eyes, nose, and mouth of faces).This is because the generator can generate real images even if some details or colors are lost, and the discriminator classifies these images as real with a small adversarial loss.
Local structural loss.Content loss supervises the image as a whole owing to the small proportion of the eyes, nose, mouth, and other areas of the face in the entire image; using content loss solely cannot restore these parts well.Many methods [5,[31][32][33] introduced a local structure loss of the face to enhance important facial components in terms of visual perception.A common practice is to extract the face parsing map of the input image or directly perform face segmentation and then apply different importance weights to different facial components.Local structural loss is defined as follows: where M i is the semantic mask of the i-th component extracted from the facial image and ω i refers to the weight of the i-th component.The models typically apply local structural loss to components such as the eyes, eyebrows, teeth, lips, and nose, but not to areas such as the skin and hair.The introduction of local structure loss can force the network to reconstruct sharper details in the key component parts of the face.
Identity preserving loss.Similar to perceptual loss, identity preserving loss limits the distance between the deep features of the two images.In [5], the authors adopted the facial recognition network ArcFace [58] to extract the identity features of facial images.By limiting the distance between the identity feature of the deblurred image and the real sharp image, the network prevents the distortion of the deblurring result, thereby improving the accuracy of the subsequent face recognition and other tasks.
Other losses.In addition to the aforementioned losses, some specific loss functions exist to improve the performance of specific models.Unsupervised methods [44] define reblurring loss to address the lack of ground-truth images.Another unsupervised method proposed in [47] addresses the lack of training image pairs by defining "self-measurement" loss and "swap-measurement" loss.A prior loss was introduced in [37] to ensure that the deep feature priors used to guide the network training are accurate.To extract accurate 3D face priors, Ren et al. [35] defined a rendering loss function to ensure that the 3D face reconstruction process is unaffected by blurry images.
Each loss function has different effects on the deblurred images.In the actual training process, it is necessary to comprehensively select a variety of loss functions according to the actual needs of the network model to improve network performance.

Training scheme
A face deblurring model may contain multiple subnetworks, and these subnetworks or the entire network is trained separately with various loss functions.Moreover, the training strategy of the large network will also affect the final result to a certain extent.A progressive training strategy was adopted by [33].First, they trained a subnetwork.The training scheme is as follows: with the remaining networks fixed, each subnetwork is trained separately with a specific partial loss function as a coarse adjustment.Then, all subnetworks, i.e., the entire model, are jointly trained by minimizing the overall loss.
In addition, to improve the model performance and enable it to handle random blur kernels in real-world blurry images, Shen et al. [31] proposed an incremental learning strategy.The authors synthesized a large number of blurry-sharp training pairs consisting of different blur kernel sizes.Direct training on a large amount of data can easily lead to a model falling into a locally optimal solution.The incremental learning strategy gradually increases the training data and the motion blur kernel size during the training process.The network is first trained on small blur kernels, and then the training set is progressively expanded to add large blur kernels.Then, the network is trained on the set of original and new blur kernels together.Larger blur kernels are continuously added until the training data include all the blur kernels.

Evaluation metrics
Evaluating the output image quality can help in judging the quality of a model.However, it is difficult to construct a sufficiently objective metric that conforms to human perception.In the field of face deblurring, the metrics often used for comparison can be divided into three categories, according to different levels and purposes.The first type of metric aims to provide an evaluation from the pixel level of the images.The second type evaluates the quality of the images in terms of visual perception by extracting the deep features of the images.The third type is task oriented, which evaluates the quality of images by comparing their accuracy in advanced visual tasks.

Image-level evaluation metrics
Among these metrics, the most commonly used are peak signal-to-noise ratio (PSNR) and structural similarity (SSIM).These metrics do not require any additional inputs and can be obtained through simple calculations.When calculating, they require the ground truth corresponding to the images, which are the reference evaluation metrics.
PSNR: This metric is acquired by calculating the pixel-level mean squared error of two images.The larger the value, the smaller the difference between the two images.However, the numerical results of PSNR are often inconsistent with the subjective perception of human vision.Using the PSNR metric tends to lead to oversmoothed results.
SSIM [59]: SSIM is modeled after the visual system of humans.This metric measures the difference between two images in terms of brightness, contrast, and structure.The mean, variance, and covariance of the images are used to evaluate their brightness, contrast, and structural similarity, respectively.The larger the value of SSIM, the more similar the two images.However, this metric is not a good representation of how humans actually feel.Images with lower SSIM values may also have a good visual experience.

Perceptual evaluation metrics
These metrics mine deep features in images and can often reflect results that are consistent with human vision.They include reference evaluation metrics that require a ground truth and no-reference evaluation metrics that do not require clear images.
LPIPS [60]: Different from the above metrics, LPIPS calculates the distance between two images in the high-dimensional feature space to make the calculation result as close as possible to human visual perception.High-dimensional deep features are extracted using a pretrained classification network.LPIPS is a perceptual metric, and a smaller value indicates that two images are visually similar.The calculation formula is as follows: where d is the distance between x and x 0 , ŷl hw represents the feature extracted from the network, and w l is used to scale channel l.Finally, the L 2 distance is calculated, the average in space is taken, and the channel is summed up.
NIQE [61]: This metric constructs a set of quality perception features and fits them to a multivariate Gaussian model.The characteristics of quality perception are extracted using the natural scene statistics (NSS) model.Then, the quality of the test image is given as the distance between the multivariate Gauss (MVG) fitting of the NSS features extracted from the test image and the MVG model of the quality perception features extracted from the natural image corpus.The calculation formula is as follows: where ν 1 and ν 2 are the mean vectors and Σ 1 and Σ 2 are the covariance matrices of the natural MVG model and distorted image MVG model.The smaller the value, the better the image quality.

Advanced visual task evaluation metrics
Image deblurring is a low-level task that aims to improve the accuracy of higher-level tasks, such as image recognition, object detection, and image segmentation.Therefore, for the face deblurring task, the meaningful metrics are the identity invariance of the deblurred faces and the accuracy of the face recognition.
Identity distance.To measure the face identity similarity between deblurred images and ground-truth images, many methods compute their feature distance d V GG in pre-trained face recognition networks such as VGGFace [67].The smaller the feature distance, the more similar the face identities in the two images.

Performance evaluation
Because model-based methods do not require training datasets, they cannot be compared directly with learning-based methods.Specifically, model-based methods assist in the optimization process by designing a specific prior as a regularization term or building a sample library.However, learning-based methods often need to train the network on specific datasets to ensure the effect on the test images of the same distribution.Therefore, in this section, we compare the experimental performance of representative model-based and deep learning-based methods.

Comparison of model-based methods
For the traditional methods, we compared four classes of methods: sparse prior-based methods [71,72], class-specific prior-based methods [12], exemplar-based methods [16], and the edge selection-based general deblurring methods [73,74].
We first show the performance of these methods on synthetic datasets, where test images were selected from the CMU PIE dataset.We chose PSNR and SSIM as the evaluation metrics, which are given in the respective papers, and we show them in Table 3. Pan et al. [16] did not release their source code and their results for these two metrics; therefore, we only show their visual performance.To compare the visual effects, we show their qualitative comparison examples in Fig. 12, which are given in [12], as these authors did not provide a full implementation of their methods.The bottom-left corner of the recovered image shows the blur kernel estimated using the aforementioned methods.
The methods of Shan et al. [71] and Krishnan et al. [72] produced obvious artifacts and multiple edges (ghosting).This is because the sparse prior is a general prior and does not perform well on facial images that contain less texture information.The methods of Cho and Lee [73] and Xu and Jia [74], which were developed for general deblurring, estimate blur kernels that are denser than the ground truth.This is because there are insufficient sharp edges in the facial image to recover an accurate blur kernel.The exemplar-based method of Pan et al. [16] suffers from severe artifacts in the skin texture.This is because it only considers the restoration of  [73] Edge selection 24.38 0.699 Xu and Jia [74] Edge selection 23.30 0.739 Anwar et al. [12] Class-specific prior the contour of the face without considering the correctness of the internal details of the face.The method of Anwar et al. [12] learns specific priors for different classes by learning the distribution features of different types of images in the frequency space.However, the restored images are too smooth and lose the contour information of specific parts.We also show the performance of the representative methods of the above four categories on real blurry images, which are shown in Fig. 13.These images do not have a corresponding sharp ground truth.These four classes of methods do not perform well and produce different degrees of artifact on real images.Furthermore, these methods do not restore the sharp results effectively in areas such as the eyes and nose.

Comparison of learning-based methods
We compared the methods based on semantic segmentation maps, methods based on generative priors, blind face restoration methods, and unsupervised face deblurring methods.We compared the performance of these methods on the two most commonly used datasets: the Shen et al. dataset [31] and the MSPL dataset.The Shen et al. dataset was constructed from the Helen and CelebA datasets; therefore, we compared the performance of the above methods on these datasets separately.For the evaluation metrics, PSNR and SSIM are the most commonly used metrics; however, they cannot represent the most realistic visual effects.Here, we also chose the perceptual metrics LPIPS and VGG distance.These representative methods and their quantitative comparisons are presented in Table 4.Note that all these results are from the respective papers.
Shen et al. [33] did not provide their model implementation and their results on perceptual metrics in their paper, and, therefore, we do not show them here.For pixel-level evaluation metrics, such as PSNR and SSIM, the UMSN and DFPGnet models achieved the best and second-best results, respectively.This shows that, when used properly, both semantic segmentation maps and generative priors can help the model to achieve good performance.For the perceptual metrics such as d V GG and LPIPS, the DFPGnet and MSPL models achieved the best and second-best results, respectively.This shows that the network based on a generative prior can effectively learn the perceptual similarity between images.In addition, the unsupervised method proposed by Xia et al. [47] achieved excellent performance next only to those of the above methods and outperformed most unsupervised methods.This model requires two images of the same scene with different degrees of blur as inputs to compensate for the lack of ground truth.This setting can help in the training process of the model and make it more robust.That is, adding  [31].From left to right: blurry image, ground truth, Shen et al. [31], UMSN [32], MSPL [34], Wang et al. [5], and DFPG [37].
auxiliary images to the unsupervised model is an idea that is worth trying.In contrast, unpaired learning methods [45] and multi-task blind face restoration methods [5] achieved poor results.
We also present the qualitative comparison between the Shen et al. dataset and MSPL dataset in Figs. 14 and 15 which was given by [37].Shen et al. [33] introduced the semantic information of the face as a prior to the network for the first time and achieved a breakthrough deblurring result.However, it produces overly smooth results with unnatural restoration results in small areas such as the eyes and nose.Models such as UMSN [32] assign different weights to face components, providing clear results even for small face components.This shows that introducing a semantic segmentation map into the model and processing different semantic regions separately can improve the performance of the model.However, the model cannot adequately recover the blur of the image background and the generated faces are somewhat distorted in identity.The representative method DFPGnet, which takes the deep features of the face as a prior, produces deblurring results that are consistent with the identity of the original image.This shows that the deep features of the image simultaneously contain information such as identity, shape, and texture, which can guide the network to generate more realistic results.However, in the facial beard, eyelashes, hair, and other parts, the models still failed to restore sharp results.As shown in Fig. 14, the blind face restoration method [5] produces severe artifacts in the test images.This is because the synthetic training datasets that they used only considered a simpler Gaussian blur.Therefore, their method can only handle simple and small blurs.The method exhibited poor results when the degradation of the image was severe.As shown in Fig. 15, even the best unsupervised methods cannot produce results comparable to those of supervised methods.Restoring sharp images without an aligned ground truth remains a difficult problem.

Discussion
In this section, we will discuss the main differences between model-based and learning-based methods.
Flexibility of the model.Model-based methods usually need to manually design specific priors, which are very important for the final deblurring performance to limit the solution space of the problem.In other words, we often need to design a more refined prior for specific requirements, which is very inflexible.In contrast, deep neural networks can fit Fig. 15 Qualitative comparisons of the representative deep learning methods on the MSPL test dataset [34].From left to right: blurry image, ground truth, Shen et al. [31], Lu et al. [45], Xia and Chakrabarti [47], UMSN [32], MSPL [34], and DFPG [37].
complex and varied blur processes owing to their powerful fitting capability.Therefore, most of the approaches focus on designing more powerful network structures.However, this often requires retraining the network when there are differences in the image distribution.
Deblurring performance.Traditional methods usually introduce strong assumptions, such as the existence of a large amount of edge information in a scene.These assumptions result in poor generalization of the model and in ringing and artifacts on severely degraded real blurry images.Benefiting from large model capacity and large-scale datasets, deep learning methods can learn a wide range of blurry patterns and produce better results than traditional methods on test images.
Inference speed.Traditional methods are based on an iterative optimization process; therefore, it takes several minutes to tens of minutes to process one image.In contrast, deep learning methods require only a few milliseconds to process an image during the test time.
In summary, learning-based methods outperform traditional methods in terms of flexibility, performance, and inference speed.Therefore, the current mainstream methods are all based on deep learning and are committed to exploring better network architectures or better training tricks.

Limitations and future research
Deblurring of facial images is a challenging research topic.Here, we summarize the problems and limitations of existing methods and provide possible future research directions.

Model structure design
There are four main limitations to the current methods.First, most of the methods can only process spatially uniform blur.In real-world settings, many scenes have non-uniform blur, such as blur caused by human motion while the background is clear and blurry background but clear foreground caused by using a large aperture of the camera.Most of the existing methods are unable to deal with such spatially non-uniform blur, and, therefore, identifying the specific blurring degrees of different parts of an image is a problem that needs to be addressed.Second, the current methods are effective in frontal facial images.However, for side faces, facial images with partial occlusion (such as glasses and hands covering the face), and monitoring equipment images with multiple faces in one image, the result will be severely distorted and anamorphic.Third, when the motion blur is severe, current methods often fail.For severely blurry facial images, these methods can lead to issues such as displacement of the facial components.Fourth, the recognition accuracy of deblurred facial images is still inferior to that of ground-truth images.
The ultimate purpose of deblurring is to improve the accuracy of high-level visual tasks.At present, the most advanced methods have achieved the same accuracy in face detection as that for clear images.However, in terms of face recognition accuracy (calculated using MobileNet trained with ArcFace loss), there is still a gap of four percentage points between SOTA deblurred images and clear images [37].
For problems 1 and 2, we can aim to design finer and better network structures.Many deblurring methods developed for general scenarios have designed models specifically for spatially nonuniform blur [75][76][77], which can be used as a reference.To improve the performance of various facial poses (front or side), we need to explore how to integrate various facial priors, such as semantic segmentation maps, generative priors, and 3D priors, into the design of network structures to maximize their effects.For problems 3 and 4, we can try to increase the model capacity to improve the fitting ability.
Currently, many transformer-based structures [78,79] have been developed for image restoration tasks.With the powerful fitting ability of the transformer, the performance of severely blurry images can be improved.

Construction of datasets
Creating aligned blurry-sharp face image pairs is very difficult.According to this survey, most of the existing datasets are generated by convolving the sharp images with predefined blur kernels.There are three main problems here.First, synthetic datasets cannot represent the distribution of real images in the wild.Second, most of the synthesized blur kernels are spatially invariant, which cannot represent dynamic scenes with different blurs in different places.Third, some methods perform better than others only on specific datasets, but the performance does not hold when switching to another dataset.Therefore, it is necessary to develop benchmark datasets that cover different blur types.

Unsupervised learning
Supervised deep learning methods require a large amount of paired training data to provide supervised information for the model, resulting in data dependence.However, unpaired learning can be trained on real blurry data without a ground truth, thereby improving the reconstruction ability of the model on real blurry facial images.Therefore, it is necessary to develop unsupervised methods for face deblurring.
Existing unsupervised methods are mainly based on a GAN.Because there are no ground-truth images for the training, it is necessary to design proper loss functions to overcome this disadvantage.Moreover, these methods do not fully utilize the semantic or structural information of faces for guidance.When there is no corresponding ground-truth image, the extracted semantic information tends to be inaccurate.Therefore, extracting the correct semantic information from blurry images and incorporating it into the training of the network are a potential direction for unsupervised learning.

Model generalizability
The performance improvement of deep learning methods mainly relies on large-capacity models and large datasets.Large-capacity models are extremely computationally expensive during training, and their training results are overly dependent on the training dataset.There are gaps in the distribution of data for people with different skin colors, ages, and ethnicities.Owing to this gap between domains, deep learning models exhibit poor generalization performance across different datasets and are prone to artifacts and other phenomena.Therefore, methods for domain adaptation should be developed and implemented to address this problem.

Computational cost
When deeper networks are used to improve performance, model parameters and computational complexity are also improved.It is difficult to deploy the large-capacity models on mobile phones or embedded devices of monitoring systems.Therefore, the development of a lighter face deblurring model that combines and utilizes the characteristics of the device such as a dual camera [80] is a potential future direction.

Fig. 1
Fig. 1 Different types of blurry facial images.

Fig. 2
Fig. 2 Taxonomy of the facial image deblurring methods studied in our paper."DL" indicates "deep learning".Details of these methods can be found in Sections 2 and 3.

Fig. 3
Fig. 3 Facial structures extracted from blurry input images by Pan et al. [16].(a) Blurry facial image to be restored; (b) matched example facial image; (c) corresponding edge of the facial structure.

Fig. 10
Fig.10 Deblurring neural network using the deep features of images as a network prior[37].The deep features are extracted by the prior estimation stream."SFT" indicates spatial feature transform module and "SSFT" indicates self-spatial feature transform module.

Fig. 11
Fig.11Unpaired methods for facial image deblurring[45].They entirely adopted CycleGAN, including the blurring branch (top row) and deblurring branch (bottom row).E c B and E c S are the content encoders for the blurry and sharp images, respectively; E b is the blur encoder; and GB and GS are the blurry and sharp image generators, respectively.

Table 1
Overview of the single facial image deblurring methods.

Table 2
Overview of the different datasets constructed by different authors for use in training or testing.

Table 3
Evaluation results of the performance of the model-based representative methods on the CMU PIE dataset.The best results are highlighted in bold.