1 Introduction

Face Super-Resolution (SR) is an important subset of image super-resolution technology for public security. Face SR computes Low-Resolution (LR) face images, which are often acquired by low quality surveillance camera, to estimate High-Resolution (HR) face images.

Due to the constraints of the environment, the faces acquired by surveillance cameras are unclear in many cases. One way is to upgrade the imaging system to a more expensive and higher resolution system [1]. However, it is cumbersome and expensive to realize. In addition, it cannot resolve the issue of small face images that were captured far away from the camera. Therefore, researchers have proposed SR algorithm to enhance the image quality, and SR is now widely used in most situations [1,2,3,4,5,6].

The intent of SR is to infer from LR images a priori information to obtain the HR images with clearer details. In single face image super-resolution, only one LR face image can be utlized to reconstruct the desired HR face image. Since desired HR face image has more pixels than the LR face image, it is a morbid inverse problem. Traditional solution is applying constraints based on the features of the face to the HR estimation process. These techniques can be broadly classified into three categories: interpolation, reconstruction, and learning-based methods [3]. Interpolation-based method [7] samples a given LR image and imposes smoothing constraints on the interpolation of missing information in the HR image. It is simple to implement, but the reconstructed image is blurry. Reconstruction-based method adds a priori knowledge that forces a constraint on the process of down sampling to generate the original LR image to reconstruct the HR image [7]. Learning-based method maps LR to HR images by learning the relationship between LR and its corresponding HR images. With the development of deep learning, the performance of learning-based methods has gradually surpassed all other SR methods.

In recent years, SR algorithms based on deep learning have attracted tremendous attention in super-resolution research community. Dong et al. [8] combined image super-resolution techniques with deep learning to design a Convolutional Neural Network (CNN) with only three layers of convolutional layers. Compared with the traditional super-resolution method based on sparse coding [9], the method in [8] has greatly improved the performance; however, SR image losses significant information due to the shallow network structure. Kim et al. [10] observed that the low-resolution image and its corresponding high-resolution image are similar, i.e., the low-frequency information of the LR image is similar to the low-frequency information of the high-resolution image. Therefore, if the residual of the high-frequency information between the high-resolution image and the low-resolution image can be accurately predicted [11], it is possible to obtain a high-quality SR image while reducing the computational burden. However, it is difficult to find a satisfactory threshold to achieve the best SR effect due to the gradient threshold strategy used during the training process. Most SR method based on deep learning initially used interpolation of the low-resolution image for a high-resolution image first before it is computed in the neural network, which incurred higher computational cost. Lai et al. [12] designed a multi-resolution CNN, which performs 2 times up sampling at each stage, predicting HR image step-by-step, thus reducing computation time. Christian et al. [13] think that although using MSE as loss function during training can obtain a high peak signal-to-noise ratio, the predicted images usually lose high-frequency details. Reference [13] uses perceptual loss and adversarial loss to improve the realism of the predicted image. Zhang Y et al. [14] proposed channel attention mechanism based on deep residual networks, adaptively learning channel characteristics by considering the interdependence between channels, thereby improving the perceptual quality of predicted images.Reference [15] designed a two-step deep network structure: coarse network (corresponding to coarse loss) and refinement network(corresponding to refinement loss and GAN loss). In addition, an attention mechanism is introduced to give higher weight to features similar to the missing parts in the impainting process. In reference [16], non-local operation is introduced into the end-to-end neural network to capture the correlation between features and their adjacent features, and it is proved that limiting the range of adjacent features is very important when calculating feature similarity. At the same time, the use of RNN improves the utilization rate of parameters and improves the robustness of the model. It is worth noting that super-resolution method based on CNN can achieve good performance in terms of peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) while the output images are often over smooth resulting in poorer perceptual quality.

Since Wavelet Transformation (WT) can perform multi-frequency analysis and preserve the edges of images well, wavelet-based SR method is often used in image processing [17]. Wavelet-based method performs wavelet transform on the high-resolution image to obtain wavelet sub-band coefficients. Features extracted from LR images are mapped to different sub-band wavelet coefficients. Generally, the features extracted from the LR image are used to predict the wavelet coefficients of unknown HR image, and then reconstruct the desired HR image from predicted coefficients. The task of wavelet-based method is to estimate unknown high frequency coefficients. Traditional solution is learning the scale dependence between low frequency coefficients and high frequency coefficients, and applies the mapping function to estimate detailed coefficients unknown [18, 19]. Finding the missing high frequency coefficients accurately in LR images is still a challenging problem. With further development of deep learning, many methods based on estimating the high frequency coefficients have been proposed in [20,21,22,23,24,25].

In [22], a deep neural network model which combines wavelet transform and CNN is proposed to predict missing details in the wavelet coefficients of low-resolution images. Z. Zhong et al. [24] pointed out that CNN-based method has sharp performance degradation on extremely low-resolution image super-resolution tasks, and the output SR image is over-smoothed. We proposed using wavelet transform to decompose HR image to different sub-band coefficients, and using CNN to predict the coefficients of the HR from LR image to infer SR images. Huang H et al. [25] introduced the method of Generative Adversarial Networks (GAN) [26] in the wavelet-based deep SR network. In this method, wavelet coefficients predicted by CNN and wavelet coefficients decomposed by ground truth are trained. The resultant SR image appears more realistic.

Although wavelet-based deep learning method for super-resolution performs better in detailed texture reconstruction than CNN-based method, but wavelet-based deep learning SR methods lacks translation invariance property because it uses the structural information of the image for super-resolution. In order to overcome this problem, we proposed a new deep SR network, namely, wavelet-based face mask super-resolution network (MWSR). Our work is as follows: (1) A pre-trained segmentation network is used to obtain the facial mask by detecting the facial features of the human face. Data augmentation is then performed. The purpose of the first phase is to focus the attention on the facial features and enable translation invariance to the wavelet-based CNN. (2) We introduce the method of [27] to supplement the image edge with information extracted by the canny edge detection operator. (3) We introduce the linear low-rank convolution operation in the stage of feature embedding, which improves the accuracy of the predicted wavelet coefficients without increasing the computational cost.

2 Discrete wavelet transform

To improve the performance of the wavelet-based SR method, the relationship between high-frequency wavelet coefficients and its LR image was investigated. Reference [25] verified through experiments that the high-frequency wavelet coefficients of the image gradually decrease with increasing blurriness. Whether the high-frequency wavelet coefficients can be restored determines whether the obtained SR image is clear. CNN-based SR method has a sharp degradation in performance on very low-resolution images mainly due to the loss of high-frequency information. In order to reconstruct the high-frequency details of the image, this paper combines the wavelet discrete transform with the deep convolutional neural network to obtain a better SR image.

In this paper, Haar transform [28] is utilized to transform the two-dimensional image signal into four sub-bands using the low-pass and high-pass filters. The high-pass filter is processed horizontally, vertically and diagonally and the flow of the two-dimensional discrete wavelet transform is presented in Fig. 1. Firstly, this paper generates the detailed coefficient of the image by performing two-dimensional wavelet transform of the input image.

An example of the face image and the wavelet coefficients after two-dimensional discrete wavelet transform is presented in Fig. 2. The example face image is from the dataset CelebA [29]. Right part of Fig. 1 is the transformed domain using the two-dimensional discrete wavelet transform to capture the image details in the four sub-bands.

SR task can be considered as a problem of inferring an HR image containing image detail information from a LR image in which image detail information is missing. Wavelet decomposition provides an elegant structure to separate the LR information from the details as shown in Fig. 1 where D, H and V represent the detailed information.

In our proposed method, LR image was fed into an attention-based deep SR network, to predict D, H, V and A of the corresponding HR image using DWT. General DWT decomposition is shown in Fig. 1. After all DWT sub-bands are predicted, computed sub-band blocks are used to recover the SR image by two-dimensional discrete wavelet inverse transform as shown in Fig. 2.

Fig. 1
figure 1

2d discrete wavelet transform(DWT)

Fig. 2
figure 2

Image super-resolution based on 2d DWT and 2d inverse discrete wavelet transform (IDWT)

3 Face mask wavelet-based super-resolution Network

3.1 Architecture

As shown in Fig. 2, MWSR consists of three sub-networks, which are mask generator network, attention-based feature embedding network, and wavelet coefficient prediction network.

We use a pre-trained semantic segmentation network to generate facial mask images during training, and we remove it from the MWSR while testing. Due to the problem of spatial information loss when the image is processed in CNN, this paper introduces linear low-rank convolution operation in the feature embedding stage of SR network without incurring additional computational burden to the convolution layer. The skip connection is applied in the wavelet coefficients prediction phase of the network so that it greatly reduces training to learn the low-frequency wavelet coefficients. Finally, two-dimensional inverse discrete wavelet transform is utilized to reconstruct the high-resolution image by using the predicted wavelet coefficients. Because of missing information while the information propagates in CNN, we introduce linear low-rank convolution operation in feature embedding. The wavelet coefficients prediction network adopts the design of residual connection, to greatly reduce training to learn the low-frequency information of the network. Finally, the wavelet coefficients obtained by the prediction network are used to reconstruct the high-resolution image by two-dimensional discrete wavelet inverse transform.

Fig. 3
figure 3

The architecture of MWSR

Fig. 4
figure 4

The backbone of MWSR which is our supplement to implementation details of feature embedding and wavelet prediction mentioned in Fig. 3

3.2 Facial mask

The visual attention mechanism is a characteristic of our human visual perception system. The human vision acquires the focus of the target area by quickly scanning the global image. The targeted area is generally called the focus of attention. Once targeted area is determined, it starts paying more attention to the area to obtain more details, ignoring other secondary or useless information.

Inspired by visual attention mechanism, the attention mechanism for deep learning aims to select information from a multitude of information that is more critical to the current mission objectives. Logically, the most recognizable position in a face image is the facial features. Most details of SR image inferred by CNN-based methods are lost [8, 12, 22, 30, 31]. Therefore, this paper designs a facial mask method which encourages more attention to the facial features while learning the mapping relationship between the LR face image and the HR face image. This method uses manual selection of a priori information in the face image to give more attention to the facial features.

It is realized by detecting the facial features from the pre-trained segmentation network [31] to generate a corresponding mask image to be trained together with the original face image. CNN can be seen as an approximator used to fit the mapping relationship between input and target, when the distribution of training data is less complex, the accuracy of CNN prediction is higher. It is worth noting that the small-angle rotation and translation operation of the generated face mask image can overcome the adverse effects of the wavelet-based method which inherently is not translation invariance. Some examples of facial mask are shown in Fig. 5. The first column on the left are the original face images, and the three columns on the right are the mask image generated by mask generator network.

Fig. 5
figure 5

Examples of face mask image

3.3 Canny edge detector

The edge of images inferred by CNN-based SR methods is blurry because the loss of information during forward propagation. In order to supplement the edge of the image, we use the canny operator to extract the edge features of the face image and use it as a loss function during the training process. Experiments are verified that it can restore more details of our facial image. The image edge loss function is defined as:

$$\begin{aligned} l_{edge}=\left\| C\left( {\tilde{I}}_{i}\right) -C\left( I_{i}\right) \right\| _{2}^{2} \end{aligned}$$
(1)

where in formula (1),\(I_{i}\) refers to the i-th input image, \({\tilde{I}}_{i}\) refers to the prediction of\(I_{i}\), and \(C(\cdot )\) represents the canny edge detector.

3.4 Linear low-rank convolution

Earlier CNN-based SR networks [8, 10, 11] are inferior than deep residual CNN-based SR network which has more convolution layers [12, 14, 30]. In order to enhance the performance of SR Network, the depth of the SR Network can be increased by stacking the convolution layer.

However, when the depth of the SR network is increased to a certain extent, information propagation between the convolution layers is hindered. Researchers often use residual learning to overcome this problem, and residual connection can be combined with image texture and semantic features to generate better quality representations [12, 13, 13, 14, 32, 33].

Due to the truncation effect of the rectified linear unit (ReLU) on CNN activation, some information is lost when the information flow is transmitted in the CNN [34]. To overcome this problem, the number of filters channel is increased at the expense of higher computational cost.

Linear low-rank convolution operation was proposed to reduce the complexity while preserving the information flow through the networks. The architecture is shown in Figs. 6 and 7. Linear low-rank convolution operation greatly alleviates the phenomenon of information loss caused by the truncation effect of ReLU.

Fig. 6
figure 6

Original convolution block

Fig. 7
figure 7

Low-rank convolution block

3.5 Loss function

In this paper, three types of loss functions are used: image edge loss, wavelet-based loss and image pixel-based loss. We use Mean Square Error (MSE) to compute the image pixel-based loss of the SR Network [17, 35,36,37,38]. In [22, 25], wavelet coefficient-based loss function is defined as the weighted sum of the sub-band coefficients and image texture-based loss function.

MSE is also commonly used in the wavelet loss function to compute the errors of the low- and high-frequency sub-band coefficients. However, it is observed in [39,40,41] that direct MSE loss function cannot achieve good perceptual SR images due to the characteristic distribution of the high-frequency sub-band coefficients of the face image. Thus, we redefine the loss function of the high-frequency sub-band coefficients and formulated as follows:

$$\begin{aligned} l_{HF}\left( {\bar{o}}_{i}, o_{i}\right) =\sum _{i}^{N} p\left( {\bar{o}}_{i}, o_{i}\right) \log \left( \frac{p\left( o_{i}\right) }{q\left( {\bar{o}}_{i}\right) }\right) \end{aligned}$$
(2)

where \(o_{i}\) and \({\bar{o}}_{i}\) refers to ground truth high-frequency sub-band coefficients by wavelet decomposition and predicted high-frequency sub-band coefficient of the SR image, respectively. \(p\left( o_{i}\right) \) denotes the characteristic distribution of \(o_{i}\) , and \(q\left( {\bar{o}}_{i}\right) \) is the characteristic distribution of \({\bar{o}}_{i}\). Low-frequency sub-band coefficients uses MSE as the loss function, so wavelet coefficient-based loss function is defined in this paper as follows:

$$\begin{aligned} l_{\text{ wavelet }}\left( {\bar{o}}_{i}, o_{i}\right)&=\alpha l_{HF}\left( {\bar{o}}_{i}, o_{i}\right) {+}\beta l_{LF}\left( {\bar{o}}_{i}, o_{i}\right) \nonumber \\&=\alpha \sum _{i}^{N} p\left( {\bar{o}}_{i}, o_{i}\right) \log \left( \frac{p\left( o_{i}\right) }{q\left( {\bar{o}}_{i}\right) }\right) {+}\beta \left\| _{o_{i}}^{-}{-}o_{i}\right\| _{F}^{2} \end{aligned}$$
(3)

In the above formulation, \(\alpha \) and \(\beta \) are hyperparameters and they are the weight of high- and low-frequency sub-band coefficients loss function, respectively.

Selecting MSE as a loss function in deep learning is usually a simple and efficient choice. And we have noticed in our experiments that in image pixel-based loss function, MSE usually gets better results on the evaluation criteria of PSNR. Therefore, it includes in the total objective function defined as:

$$\begin{aligned} l_{\text{ total }}= & {} l_{\text{ wavelet }}+\eta l_{\text{ image }}+\mu l_{\text{ texture }}+\zeta l_{\text{ edge }} \end{aligned}$$
(4)
$$\begin{aligned} l_{\text{ total }}= & {} \alpha \sum _{i}^{N} p\left( {\bar{o}}_{i}, o_{i}\right) \log \left( \frac{p\left( o_{i}\right) }{q\left( {\bar{o}}_{i}\right) }\right) +\beta \left\| \bar{o_{i}}^{-}-o_{i}\right\| _{F}^{2}\nonumber \\&+\eta \left\| \bar{I_{i}}^{2}-I_{i}\right\| _{F}^{2}+\mu \sum _{i}^{N} \max \left( \lambda \left\| {o_{i}}\right\| _{F}^{2}+\varepsilon -\left\| \bar{o_{i}}\right\| _{F}^{2},0\right) \nonumber \\&+\zeta \left\| C\left( {\tilde{I}}_{i}\right) -C\left( I_{i}\right) \right\| _{2}^{2}. \end{aligned}$$
(5)

\(\eta \) and \(\mu \) in the formula are the weight of the image-based loss function and texture-based loss function, respectively. The role of \(\lambda \) and \(\varepsilon \) in the texture-based loss function is to ensure that the value of texture-based loss function is not zero.

MWSR has four sub-loss functions: wavelet loss, image loss, texture loss, and edge loss. Since our intention is to obtain SR image with clearer image edges, the weight of edge loss is appropriately increased. In addition, to predict the high-frequency detail information which is missing in LR images, high frequency coefficients should be given higher weight than that of low frequency coefficients. In order to reduce the negative effects of MSE, a smaller weight will be used for image loss. In our experiments, these weight parameters are all hyper-parameters. We empirically set \(\alpha \),\(\beta \),\(\eta \),\(\mu \), and \(\zeta \) to 0.99, 0.01, 0.1, 1, and 1.2, respectively.

4 Experiment

The experiments in this paper are implemented on two public face data sets, CelebA [29] and LFW [42]. We selected 10,416 face images from CelebA as training set and 9,230 face images as validation set. At the same time, we selected 1,063 and 653 face images from CelebA and LFW as test sets. All face images are cropped and aligned with a size of 128\(\times \)128.

We tested SR performance of MWSR by 4 times down-sampling the original face image, and compared among some of the classic super-resolution methods. To be more complete, we compared with BICUBIC interpolation method, SRCNN [8], U-Net [31], SRGAN [13], RCAN [14], Wavelet-SRNet [22]. We trained all methods with the same CelebA and LFW training sets. In this paper, PSNR and SSIM are used to evaluate the SR performance of the above method.

Table 1 summarizes the results of the PSNR and SSIM evaluations. The best results are shown in red and the second best results in blue. As can be seen from Table 1, MWSR exceeds the previously mentioned algorithms in the evaluation criteria of PSNR and SSIM. We performed ablation experiments on MWSR as following. Ablation experiments of MWSR investigate the performance of exclusion of mask generation network, edge loss and linear convolution operation separately on the same data set. The result of ablation experiment is shown in Figs. 8 and 9.

Table 1 Quanitative Resultd on CelebA and LFW Test Sets
Fig. 8
figure 8

The PSNR of ablation experiment

Fig. 9
figure 9

The SSIM of ablation experiment

Fig. 10
figure 10

Perceptual distance of results obtained by classic super-resolution methods

Fig. 11
figure 11

Comparison with classic super-resolution methods

Reference [43] points out that both PSNR and SSIM have limitations in evaluating quality of real-world images, hence, we use a new evaluation metric (perceptual similarity) in our experiments. Reference [44] provides a wide-ranging and highly differentiated perceptual similarity dataset, which uses traditional methods (light adjustment, Gaussian kernel blur, noise addition, deformation, color change, etc.) and deep learning methods (denoising, style transfer, encoding, and decoding) to process the ground truth to generate two noise image corresponding to the ground truth. This data set uses the visual perception of different people (number 484k) to determine which noise image is closer to the ground truth, and uses this as an annotation. And the visual perception of different people (484k) is used to determine which noise image is closer to the ground truth as the annotation of the data set. Based on this data set, the authors propose a new perceptual similarity measurement, which performs better than PSNR and SSIM in simulating the underlying perceptual similarity. The results are shown in Fig. 10. The smaller the value of perceptual similarity, the better subjective quality of SR image. Compared with CNN-based networks, the proposed method MWSR shows better subjective quality.

Figure 11 shows the visual quality of SR results from the 4 times down sampled low-resolution input face image using MWSR and SR results of some of the state-of-the-art algorithms.

As can be seen from Table 1 and Fig. 11, although RCAN achieves better performance than Wavelet-SRNet in terms of PSNR and SSIM, the tooth gap is less pronounced than the results of Wavelet-SRNet. The images predicted by SRGAN are affected by the data set with obviously noise. And the SR face image inferred by MWSR recovered the facial features best, especially the details of the teeth can be clearly seen. Compared with classic CNN-based super-resolution methods, MWSR achieved better PSNR results, and recover more facial features.

CNN-based methods generally use MSE as the loss function. As MSE has the function of averaging, SR image derived by using methods such as SRCNN and RCAN could be blurred. Wavelet-SRNet uses deep convolution layers to accurately predict the high-frequency wavelet coefficients describing image details from LR image, then high-quality SR image can be reconstructed using inverse wavelet transform. Inspired by this work, we also use wavelet transform to predict high-frequency wavelet coefficients, and use a pre-trained segmentation network to generate facial mask images, giving higher attention to facial features in face super-resolution. At the same time, under the constraint of the edge loss function, MWSR can obtain better and clearer image edges in the image reconstruction stage, further improving the subjective quality of SR images.

5 Conclusion

Using CNN-based SR network can perform very well in terms of PSNR and SSIM by simply stacking more residual connection to realize extremely deep networks. However, the reconstructed image is often over-smoothed and for face SR application, facial features are lost.

Wavelet-based neural networks have better subjective quality than direct CNN-based SR algorithm although it has inferior PSNR/SSIM results. By improving wavelet-based neural network in [24], both improvements in subjective and objective metrics can be achieved which shows that wavelet-based approach has more potential for further improvements.

In this paper, a wavelet-based face image SR algorithm is proposed by using a facial mask to help trained the attention-based neural network.

The neural network learns the relationship between the wavelet coefficients of the LR face image and the HR face image by paying more attention on facial features. Wavelet structure inherently separates the low-frequency information from the details by storing this information in different sub-bands. This helps MWSR to predict SR wavelet coefficients in the different sub-bands which have the same size as LR face, thus simplifying the mapping relationship to be learned.

The masking operation allows the network to focus on the facial features, further reducing the computational burden and enhancing the accuracy of the network. Therefore, compared with most existing methods, MWSR has achieved competitive results in terms of PSNR and SSIM, as well as the best visual perceptual quality.