Deep unfolding multi-scale regularizer network for image denoising

Existing deep unfolding methods unroll an optimization algorithm with a fixed number of steps, and utilize convolutional neural networks (CNNs) to learn data-driven priors. However, their performance is limited for two main reasons. Firstly, priors learned in deep feature space need to be converted to the image space at each iteration step, which limits the depth of CNNs and prevents CNNs from exploiting contextual information. Secondly, existing methods only learn deep priors at the single full-resolution scale, so ignore the benefits of multi-scale context in dealing with high level noise. To address these issues, we explicitly consider the image denoising process in the deep feature space and propose the deep unfolding multi-scale regularizer network (DUMRN) for image denoising. The core of DUMRN is the feature-based denoising module (FDM) that directly removes noise in the deep feature space. In each FDM, we construct a multi-scale regularizer block to learn deep prior information from multi-resolution features. We build the DUMRN by stacking a sequence of FDMs and train it in an end-to-end manner. Experimental results on synthetic and real-world benchmarks demonstrate that DUMRN performs favorably compared to state-of-the-art methods.


Introduction
Image denoising is a fundamental problem in lowlevel vision since corruption by noise is inevitable during the image acquisition process. Image denoising aims to recover the latent clean image x from the corresponding noisy observation y. Mathematically, the degradation model for the denoising problem can be formulated as where n is generally assumed to be additive white Gaussian noise (AWGN) with standard deviation σ.
Due to the ill-posed nature of the image denoising problem, many conventional methods use various image prior terms based on the statistics of natural images, including sparse models [1][2][3], non-local selfsimilarity models [4][5][6][7][8], low-rank models [9,10], and Markov random field models [11,12]. Despite their significant progress, these model-based methods usually reconstruct latent clean images by solving complicated optimization problems, which limits their practical application. Other methods [9,10,13] sacrifice flexibility and efficiency to achieve high performance.
With the development of deep convolutional neural networks (CNNs), many learning-based methods [14][15][16][17][18] have been proposed for image denoising. With the powerful representation ability of deep CNNs, these methods outperform traditional model-based methods by a large margin. However, most learningbased methods directly learn the mapping between noisy and clean image pairs without considering a physical model of the noise process, which makes them more efficient but less interpretable than modelbased methods.
Taking advantages of model-based and learningbased methods, many deep unfolding methods incorporate standard optimization methods into deep CNNs. They unfold the image denoising problem through various optimization algorithms (e.g., gradient descent [19,20], alternating direction method of multipliers [21], and primal-dual [22]), and implement the regularization term using deep CNNs, which can implicitly learn deep priors in the feature space.
By integrating the image degradation constraint into CNNs, deep unfolding methods maintain the efficiency and improve the interpretability of deep learning. However, the performance of current deep unfolding methods is still limited for two main reasons. Firstly, in each iteration step of deep unfolding methods, noise is removed in the standard image space, but deep priors are learned in feature space. The transformation from deep feature space to image space limits the depth and receptive field of CNNs, which prevents CNNs from extracting non-local dependencies within images. Secondly, existing deep unfolding methods only learn deep priors at the original full-resolution, and thus cannot effectively capture spatial contextual information and restore clear edges for images that suffer from heavy noise [23].
In this paper, we propose the deep unfolding multiscale regularizer network (DUMRN) to more closely integrate model-based and learning-based methods. To reduce the number of space transformations and improve information flow within the network, we explicitly consider the image denoising process in the deep feature space, and propose a featurebased denoising module (FDM) based on the image degradation model. By mimicking the gradient descent optimization process, a sequence of FDMs is stacked to build the DUMRN, so that we can obtain a deep CNN with a large receptive field and train it in an end-to-end manner. In each FDM, we construct a multi-scale regularizer block (MSRB) to learn deep prior information from multi-resolution features, which is able to capture local details at high resolution and large-scale contextual information at low resolution.
To summarize, the main contributions of this paper are three-fold:

Related work
Image denoising methods can be roughly categorized as traditional model-based methods, deep learningbased methods, and deep unfolding methods. In this section, we briefly review these methods.

Traditional methods
Since image denoising is an ill-posed problem, various regularization or prior terms have been proposed to constrain the solution space. For example, Elad and Aharon [1] enforced sparsity on image patches by constructing highly over-complete dictionaries. Dabov et al. [6] proposed the well-known blockmatching and 3D filter (BM3D) method to combine non-local self-similarity with sparsity for image denoising. Chen et al. [13] learned a Gaussian mixture model prior from external image patches and utilized it to find similar patches in input noisy images for denoising. However, hand-crafted priors are not strong enough to characterize complex image structures and usually involve non-convex and timeconsuming optimization problems.

Deep learning methods
Motivated by the great success of CNNs in highlevel vision tasks, various learning-based methods have been proposed for image denoising. Zhang et al. [14] proposed DnCNN which incorporates residual learning [24] and batch normalization [25] in a CNN to learn residuals between the noisy input and the corresponding clean image. To increase the flexibility of the network to deal with noise of different levels, Zhang et al. [15] utilized a noise level map as input and performed denoising in the down-sampled sub-image space. Inspired by DenseNet [26], Zhang et al. [17] utilized dense connections to construct a residual dense network (RDN) for image restoration and achieved stateof-the-art results. For real-world image denoising, Guo et al. [27] proposed the convolutional blind denoising network (CBDNet) with a noise estimation subnetwork, while Kim et al. [28] constructed the adaptive instance normalization network (AINDNet) to transfer the Gaussian denoiser to various real noisy scenes. To decrease the computational cost on low-noise images, Yu et al. [29] proposed a multipath CNN with a dynamic path selection module to adaptively select appropriate routes for different image regions. Recently, several methods [16,[30][31][32] have employed multi-scale strategies to enlarge the receptive field and improve the performance of deep networks. For example, Chang et al. [16] incorporated multi-size dilated convolutions into a U-Net [33] structure to capture multi-scale contextual information, which helps to restore rich details in complex scenes. In order to increase the modeling capacity of deep priors, we utilize deep CNNs to learn deep prior information at different scales, which helps to enlarge the receptive field and capture long-range contextual information.

Deep unfolding methods
Deep unfolding methods integrate the advantages of model-based methods (e.g., good interpretability) and learning-based methods (e.g., efficiency and strong representation capability). They usually unfold iterative optimization algorithms as a cascade scheme with a fixed number of steps. Deep CNNs are utilized as regularizers in each step, which implicitly learn deep image priors. Many unfolding methods have been proposed for various image restoration tasks, including image denoising [34][35][36], image deblurring [19,20,37], and image superresolution [38]. Schmidt and Roth [37] unrolled the half-quadratic optimization procedure into an end-to-end learning framework and proposed a random field-based architecture to learn an image restoration regularizer. Chen and Pock [19] proposed a trainable nonlinear reaction diffusion (TNRD) model by unfolding the gradient descent procedure to a fixed number of iterations. To integrate sparsity regularization with deep CNNs, Simon and Elad [36] unfolded the convolutional sparse coding model by the learned iterative shrinkage threshold algorithm. By incorporating deep CNNs into a fully parameterized gradient descent scheme, Gong et al. [20] proposed to learn a universal gradient descent optimizer and construct a recurrent gradient descent network (RGDN) for image restoration. Recently, Ren et al. [39] incorporated the adaptive consistency prior into the maximum a posterior framework, and proposed DeamNet based on unfolding the optimization problem. Driven by a large training set, deep unfolding methods optimize the parameters in an endto-end manner and surpass model-based methods.
Existing deep unfolding methods utilize CNNs to learn data-driven priors in deep feature space, but they remove noise in image space. Thus deep features are transformed into image space at each iteration step, which limits the depth of CNNs and makes it difficult from them to exploit contextual information within images. In order to make full use of deep CNNs, we explicitly consider the image denoising problem in deep feature space and construct a deep network for image denoising.

Proposed method
In this section, we first propose the feature-based denoising module (FDM) which performs image denoising in deep feature space. Then we introduce the multi-scale regularizer block (MSRB) which learns deep prior information from features at different resolutions. Finally, we stack several FDMs using an unfolding strategy to build the DUMRN for image denoising.

Feature-based denoising module
From a Bayesian perspective, the maximum a posterior (MAP) model for denoising can be formulated aŝ where the first fidelity term guarantees the solution x is in accordance with the degradation process in Eq. (1), φ(x) is the regularization term associated with prior information, and λ is the regularization parameter that controls the tradeoff between these terms. Eq. (2) can be solved by various optimization algorithms, such as gradient descent [19,20], alternating direction of multipliers [21], and the primal-dual method [22]. Here we utilize the momentum gradient descent method due to its simplicity and effectiveness [40,41]. Thus,x can be obtained through the iteration in Eqs. (3) and (4): where t denotes the step index, α t is the step size, ∇φ t (·) denotes the gradient of φ(·), β t is the momentum parameter at step t, and (x t − x t−1 ) is the momentum term. In order to learn deep image priors, deep unfolding methods employ CNNs to calculate the gradient of the regularizer ∇φ(·), which implicitly provides a deep image prior. Then the iterative procedure in Eq. (3) is unrolled with a fixed number of steps T . However, in each iteration step, noise in x t is removed in the standard image space while the prior ∇φ(·) is learned in deep feature space. The mapping from deep feature space to image space limits the depth and receptive field size of CNNs, making it difficult for unfolding methods to capture non-local dependencies inside images.
In order to make full use of deep CNNs and improve the information flow in the network, we explicitly consider removing noise in deep feature space. Specifically, we first utilize a feature extractor f (·) to map image x to the feature space. Then we can use Eq. (3) to iteratively denoise f (x) without mapping it into image space. In order to scale the gradient and momentum terms adaptively, we replace the step size α t and momentum parameter β t by A t (·) and B t (·) respectively. We let S(·) replace ∇φ(·) to supplant the gradient of the regularizer; S(·) absorbs the trade-off weight λ. S(·) implicitly performs the role of the deep prior. Overall, the t-th feature-based denoising module (FDM) is formulated as After T iterations, we utilize an image reconstructor g(·) to reconstruct the final denoised image from X T : Considering the effectiveness of the residual block (ResBlock) in low-level vision tasks [42,43], we use a single ResBlock to implement A t (·) and B t (·). The structure of the ResBlock is shown in Fig. 1(a): each ResBlock contains two 3 × 3 convolutional layers and a ReLU activation function [44]. For flexibility, we implement the feature extractor f and image reconstructor g using two convolutional layers with 3 × 3 learnable kernels.

Multi-scale regularizer block
Many unfolding methods [19,20,38] implement S(·) using deep CNNs due to their strong learning capability. However, limited by the size of the convolution kernel, CNNs fail to capture diverse contextual information, and most existing deep unfolding methods (e.g., TNRD [19] and CSCNet [36]) perform poorly when the noise level is high. To overcome this problem, we adopt a multi-scale strategy by down-sampling features to different scales. On one hand, down-sampling can effectively enlarge the receptive field and enable models to exploit more spatial contextual information, which is helpful in denoising images that suffer from heavy noise [23]. On the other hand, as observed in Ref. [45], noise decreases while strong edges are less affected by down-sampling. To gain the advantages of a multiscale strategy, we implement S(·) using a multi-scale regularizer block (MSRB) to extract useful feature information at multiple resolutions: (8) where f MSRB (·) represents the proposed MSRB function, ↓ k represents the down-sampling operator with a scaling factor of k, {G i (·)} 3 i=1 denotes a set of deep CNNs that learn useful prior information from the features of different scales, and F(·) denotes the multi-scale feature fusion block utilized to aggregate features from multiple scales. In practice, we use three CNNs with the same structure but different channel The numbers of channels are set to 64, 128, 256 at resolutions 1, 1/2, 1/4 respectively. As shown in Fig. 1(b), each G i (·) consists of two convolutional layers with ReLU activation functions and a ResBlock, and all kernel sizes are set to 3.
Summation and concatenation are the most commonly-used strategies for feature fusion, but directly applying them to multi-scale feature fusion is not effective [46]. Motivated by the success of backprojection technique in image super-resolution [47][48][49], we implement F(·) using a back-projection feature fusion (BPFF) block [46] to aggregate multiscale features. As shown in Fig. 1(c), we first downsample G 1 (X) to the same resolution as G 3 (X ↓ 4 ), and compute their difference: Then we enhance the prior information G 1 (X) using the back-projected difference: We obtain the final S(X) by applying a similar update procedure to integrate s and G 3 (X ↓ 4 ): S(X) = s + (e 2 ) ↑ 2 (12) In MSRB, we use convolutional layers and deconvolutional layers to implement down-sampling and up-sampling operators respectively; the strides of both convolutional layers and deconvolutional layers are set to 2. By utilizing the BPFF block in MSRB, we can effectively aggregate prior information at different resolutions into a stronger prior.

Overall architecture
The overall architecture of our proposed deep unfolding multi-scale regularizer network (DUMRN) is shown in Fig. 2. Let y and x denote the input noisy image and the corresponding ground-truth image respectively. The feature extractor f first extracts features from y. The extracted features Y are set as the initial value X 0 for the feature denoising process. T FDMs are stacked to remove noise in feature space: we update X t via T steps using Eq. (5). Finally, we utilize the image reconstructor g to reconstruct the final image from the denoised feature X T .
The DUMRN is optimized by minimizing the difference between the input noisy images and the corresponding ground-truth counterparts. To assess the effectiveness of the proposed network, we adopt the same L 2 loss function as previous works [14- Fig. 2 Architecture of DUMRN. The first convolutional layer extracts feature X 0 from the noisy input, and then T feature-based denoising modules (FDMs) are stacked to remove noise in deep feature space; the structure of FDM is based on the momentum gradient descent algorithm. Benefiting from FDM, there is no need to transform deep features into image space at each step. The last convolutional layer converts the denoised feature X T to a clean image. A multi-scale regularizer block S(·) learns deep prior information from features at different resolutions; ↓ k represents down-sampling with a scaling factor of k. 16,18]. Given a training dataset where N is the number of the training patch pairs, we obtain the optimal parameters by minimizing the objective function in Eq. (13): where f DUMRN (·) represents the DUMRN function, and Θ represents all learnable parameters in DUMRN.

Experimental results
In this section, we first provide the training and implementation details of the proposed network. Then, we compare our proposed DUMRN with several state-of-the-art methods on synthetic and real-world image denoising tasks. Our source code is available at https://github.com/Xujz19/DUMRN.

Datasets
Following RDN [17] and SADNet [16], we adopt 800 high-resolution training images from the DIV2K dataset [50] to train our models for Gaussian denoising at four different noise levels (σ = 10, 30, 50, 70). In addition, we use noisy images with varying noise levels σ ∈ [0, 75] to train a single model for blind Gaussian denoising; it is referred as DUMRN-B. All synthetic Gaussian noisy images were obtained by adding Gaussian noise of different levels to clean images. For real-world image denoising, following DeamNet [39], we trained our model on the SIDD medium dataset [51] and RENOIR dataset [52]. To evaluate DUMRN for gray-scale image denoising, we used three benchmark datasets: Set12 [14], BSD68 [53], and Urban100 [54]. For color image denoising, we chose Kodak24 [55], CBSD68, and Urban100 as test datasets. For real image denoising, we evaluated our model on the SIDD Benchmark dataset [51] and Darmstadt Noise Dataset (DnD) [56]. We used peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) [57] as the metrics for quantitative evaluation.

Implementation details
We implemented our proposed model in the PyTorch environment and adopted the ADAM optimizer [62] with default parameters to optimize the network parameters. We trained our network on an NVIDIA TITAN Xp GPU for a total of 15,000 epochs. For each iteration, we set the batch size to 16 and randomly cropped the noisy and sharp images to size 96 × 96. Like other denoising methods [14][15][16][17][18]59], we performed data augmentation on the training images, using random flipping and rotation. We set the initial learning rate to 10 −4 and halved it every 2500 epochs. All trainable parameters were initialized using the Xavier method [63]. For the trade-off between efficiency and performance, we set the number of FDMs T to 6. More details of this choice are provided in Section 5.1.
Tables 1 and 2 summarize quantitative results for different methods on gray-scale and color image denoising, respectively. Our proposed DUMRN achieves the best PSNR under all experiment settings, demonstrating the superiority of our model. Specifically, DUMRN outperforms the most representative traditional method BM3D by 0.74-1.84 dB in different settings. For gray-scale image denoising on the Urban100 dataset with noise level σ = 50, DUMRN achieves 2.11dB/0.0678 and 0.31dB/0.0061 improvements over the previous deep unfolding methods TNRD and DeamNet, respectively. Benefiting from the incorporation of the physical model and deep CNNs, our model also outperforms state-of-the-art deep learning-based methods DnCNN, SADNet, COLANet, and RDN. Taking color image denoising with noise level σ = 50 as an example, our DUMRN obtains 0.65dB/0.0207, 0.45dB/0.0161, and 1.37dB/0.0320 improvements over DnCNN on Kodak24, CBSD68, and Urban100 respectively. Due to the proposed FDM and multi-scale strategy that enlarge the receptive field to exploit more spatial contextual information, DUMRN is especially effective when the noise level is high (σ = 50, 70). Specifically, on the challenging Urban100 dataset, DUMRN outperforms RDN by 0.42dB/0.0175 on gray-scale image denoising and 0.37dB/0.0112 on color image denoising when the noise level is σ = 70. Figure 3 shows a visual comparison of results from different methods on gray-scale image denoising. We can observe that TNRD and CSCNet generate results with severe distortion and noticeable artifacts, and no competing methods recover a clear edge of the chin in the photo, while DUMRN produces more faithful results. Visual comparisons for color images are shown in Figs. 4-6. It can be seen that our DUMRN produces sharper edges and recovers more details while other methods suffer from over-smoothing, demonstrating the powerful representation ability of our DUMRN model.
Quantitative results for our blind denoising model are also given in Tables 1 and 2. The performance of DUMRN-B slightly decreases due to the unknown noise level, but it is still higher than most deep learning-based methods which are trained for a known specific noise level, indicating that our model is robust in blind image denoising. Taking a noise level σ = 50 as an example, DUMRN-B outperforms the non-blind RDN by 0.14dB/0.0026, 0.14dB/0.0020, and 0.20dB/0.0056 on the Kodak24, CBSD68, and Urban100 datasets, respectively. As shown in Figs. 4-6, DUMRN-B also generates more faithful results than the other competing methods, further demonstrating that our model can effective tackle the challenging blind Gaussian denoising task.
Quantitative results for different methods on the SIDD benchmark and DnD dataset are provided in Table 3. All PSNR and SSIM results were obtained from public benchmark websites. It can be seen that our DUMRN achieves comparable performance to these competing methods. Visual comparisons of results of different methods on an image from the DnD dataset are shown in Fig. 7. The conventional

Model analysis
In this section, we conduct ablation studies to analyze the effects of different components, including the FDM, MSRB, and consider the unrolling length T . In addition, we also compare the computational requirements of our DUMRN network and competing methods.

Analysis on FDM
The structure of the proposed DUMRN is similar to TNRD [19] and RGDN [20]; all are based on the gradient descent method. The main difference is that our DUMRN learns a data-driven prior from deep features, but TNRD and RGDN learn priors from the original images. The performance improvement shown in Table 4 of DUMRN over TNRD and RGDN illustrates the superiority of the proposed featurebased denoising framework. To further demonstrate the effect of the deep feature space, we set the feature extractor f (·) and image reconstructor g(·) as identity functions (DUMRN-I for short) to learn deep prior information from the raw image space. Table 4 shows that the PSNR of our whole model is 0.19 dB higher than DUMRN-I, which illustrates the importance of the feature-based denoising module. A visual comparison of results in Fig. 8 Fig. 9. It can be seen that noise is removed progressively from X 2 to X 6 . Feature X 6 contains the least noise and has the sharpest texture: the proposed FDM effectively removes noise and reconstructs details.
The number of FDMs determines the depth of the DUMRN. In Fig. 10, we show the variation in PSNR performance and inference time with increasing numbers of steps T . PSNR values and inference time both increase as T rises. As T increases from 5 to 6, an obvious PSNR improvement is still obtained. However, when T becomes bigger than 6, the curve in Fig. 10 becomes flatter, and PSNR gains become minor. Thus, we conclude that DUMRN has almost converged when T = 6. Although DUMRN achieves the best PSNR performance when T = 8, it takes the most inference time. Considering the trade-off between the performance and inference time, we adopt T = 6 in our experiments.

Analysis on MSRB
To investigate the effect of the proposed multiscale strategy and the adopted BPFF block, we designed several baseline models. Specifically, we trained the following alternatives of our model: (i) removing G 2 (·) and G 3 (·) in MSRB (DUMRN-1 for short), (ii) removing G 3 (·) in MSRB (DUMRN-2 for short), (iii) changing the number of filters in G 2 (·) and G 3 (·) to 64 and letting them learn deep prior information at the original full-resolution (DUMRN-3 for short), (iv) replacing BPFF block with concatenation (DUMRN-C for short), (v) replacing the BPFF block with summation (DUMRN-S for short). Quantitative results are shown in Table 5. Compared to DUMRN-1, DUMRN-2 has one more branch that learns deep prior information at the coarse resolution, and DUMRN-3 has two more branches that learn deep prior information at the original full-resolution. It can be seen that DUMRN-2 achieves better denoising performance than DUMRN-3, indicating that using features at different scales can provide more effective prior information. In addition, DUMRN outperforms DUMRN-1, DUMRN-2, and DUMRN-3, which further demonstrates that multiscale information facilitates image denoising. Visual comparisons are given in Fig. 11; we can observe that a learning deep prior at the original full-resolution (DUMRN-1) is insufficient to recover fine texture details. Taking advantage of the multi-scale prior information, the denoised result of DUMRN contains fewer artifacts and much more detail than the results  of DUMRN-1 and DUMRN-2.
From Table 5, it can also be observed that DUMRN generates better results than DUMRN-C and DUMRN-S: simply using concatenation or summation can not effectively integrate multi-scale information.

Speed
We further evaluated the inference time for different methods for processing a 512 × 512 gray-scale image. All inference time was tested on an NVIDIA TITAN Xp GPU. Figure 12 shows that our DUMRN achieves better PSNR performance with lower inference time than the state-of-the-art methods RDN [17], COLANet [18], and NLRN [58]. Although DUMRN is a little slower than DnCNN [14], DudeNet [59], and SADNet [16], it achieves much better denoising performance. Overall, DUMRN performs well in terms of both effectiveness and efficiency.

Limitations
Like other denoising methods, DUMRN may fail to reconstruct proper details in some challenging cases. As Fig. 13 shows, strong noise may make  vertical textures imperceptible in the noisy input, and as a result DUMRN and the state-of-the-art RDN cannot correctly recover the vertical textures shown in the close-up. Severe corruption makes it difficult to restore the textures, and DUMRN generates the most likely texture patterns learned from the training dataset.

Conclusions
In this paper, we proposed the deep unfolding multi-scale regularizer network (DUMRN) to better integrate the traditional image denoising model with deep neural networks. We explicitly consider the image denoising process in deep feature space, and propose a feature-based denoising module (FDM) following the iterative optimization steps of the image degradation model. Benefiting from the FDM, we can construct a deep network with a large receptive field to effectively learn deep prior information. In addition, we proposed the multi-scale regularizer block (MSRB) to extract more spatial contextual information from features of different scales, which is beneficial for images that suffer from heavy noise. We also analyzed the effect of each component in the proposed DUMRN. Extensive experiments demonstrate that our proposed DUMRN can effectively and robustly denoise images, assessed both by quantitative metrics and visual quality. In future, we will investigate real noise models and modify DUMRN to achieve better results for real image denoising. We will also extend our model to other image restoration tasks, such as image deblurring and rain removal.