Learning-based low-rank denoising

The denoising of 2D images through low-rank methods is a relevant topic in digital image processing. This paper proposes a novel method that trains a learning network to predict the optimal thresholds of the singular value decomposition involved in the low-rank denoising of 2D images. To improve the denoising results, we apply the block-matching algorithm and classify each 3D block according to four parameters, which increase the specificity of the network for the prediction of the thresholds. Our method outperforms state-of-the-art methods for image denoising; furthermore, it is general with respect to the type of noise and provides an upper bound to the accuracy of the denoising of 2D images through the Singular Value Decomposition.


Introduction
Signals acquired by digital sensors are generally affected by different types of noise, such as speckle [1] and exponential [2] noise on biomedical images, salt-and-pepper [3], Gaussian [4], and Poisson [5] noise on images acquired through camera sensors (Sect. 2).
Our goal (Sect. 3) is the definition of a novel denoising method belonging to the low-rank class, based on the singular value decomposition (SVD) and the learning and prediction of the optimal thresholds. Given a training data set of groundtruth images (Fig. 1), we apply different types of noise (e.g. Gaussian, speckle, salt-and-pepper, Poisson, exponential noise) and extract 3D blocks through the block-matching algorithm [6]. Then, we compute the optimal singular values through a proper optimisation applied to the SVD of each 3D block. We iterate this approach, where the input image of each iteration is the denoised image at the previous step.
The input and optimal singular values compose the training data set of the learning model; the singular values of the training data set are classified according to four parameters: noise type and intensity, number of iterations of the method, and clustering of the singular values. This classification allows us to design specific networks and improve the accuracy of the learning model. The learning phase optimises a matrix of weights, applied to train the model and B Simone Cammarasana simone.cammarasana@ge.imati.cnr.it 1 CNR, IMATI, Via de Marini, 6, 16149 Genoa, Italy predict the optimal thresholds of the SVD. For the image denoising, we compute the 3D blocks and the SVD, the four parameters of each 3D block, and the thresholds through the network's prediction. Finally, the block-matching aggregation is applied to reconstruct the denoised image (Sect. 4).
As main contributions, we train our learning-based model to predict the optimal thresholds of the SVD. In contrast, learning-based methods [7] predict the denoised image; lowrank methods [8] manually select the SVD thresholds; the block-matching 3D (BM3D) [6] applies a wavelet decomposition, with a manual selection of the thresholds. Our denoising method is independent of the type of noise and provides an upper bound to the accuracy of the denoising of 2D images through the SVD. Our approach can be generalised to different transformations of the noisy signal, e.g. spectral (wavelet, shearlet), factorisation (e.g. SVD, HOSVD), or to volumetric data through spherical harmonics and Laplacian eigenpairs (Sect. 5).

Related work
Low-rank methods compute the denoised image as the lowrank matrix approximation of the noisy image [8]. This class of methods is extensively applied in biomedical 2D [9] and 3D [10] images, hyperspectral images [11], and 2D videos [12]. The numerical properties of the singular values thresholds are analysed [13] for low-rank matrix completion. Total variation (TV) and low-rank approximation are applied  [14,15], by iteratively updating the weight parameters of TV regularisation. The learning of an adaptive regulariser for low-rank approximation [16] assumes a Laplacian distribution of the singular values. A Wiener filter [17] is estimated by optimising a low-rank approximation, where eigenvalue thresholds are computed by using a prior non-local selfsimilarity. A tensor-based approach [18] applies low-rank approximation.
The weighted nuclear norm minimisation (WNNM) [19] computes the stacks as BM3D method, performs an SVD on the stacks and applies a weighted threshold to the singular values, where lower weights correspond to higher singular values. Lower singular values capture the noisy component of the image, so their reduction is larger. Finally, the collaborative filtering reconstructs the denoised image.
Spectral decomposition transforms a signal into its spectral domain and exploits the sparsity of the transformed signal to remove noise, through a threshold operation [20]. The BM3D [6] computes and stacks similar patches through the non-local means (NLM) method [21], which uses the patterns' redundancy of the input image, and each patch is restored with a weighted average of all the other patches, where each weight is proportional to the similarity among the patches. In BM3D, each stack is transformed into its spectral domain with wavelet decomposition, denoised through a hard/soft threshold, and reconstructed in the space domain. Then, the denoised patches are aggregated by a collaborative filter to reconstruct the denoised image.
An over-complete dictionary represents the signal as a linear combination of atoms [22], which are iteratively updated through the SVD of the representation error to better fit the data. To estimate the coefficients of the sparse representation of the input image, the non-locally centralised sparse representation (NCSR) [23] exploits the non-local redundancies, combined with local sparsity properties; the dictionary is learned by clustering the patches of the image into k clusters through the k-means [24] method and then learning a sub-dictionary for each cluster.   N) is composed of the groundtruth X and the noisy component N, where f defines the combination of one or more types of noise (e.g. additive, impulsive, multiplicative). Low-rank approximations recover the underlying low-rank matrix from its degraded observation. In this class, the nuclear norm minimisation recovers an estimationX of the ground-truth signal through the minimisation of the energy functional Y−X 2 F +λ X * , where the nuclear norm X * is the sum of the singular values of X, and measures the compliance of the approximated image to the high-frequency components of the input image. This problem is equivalent to recover the approximated imageX = U(τ (S))V , where Y = USV , S := diag(s), is the SVD of the noisy signal, and τ (S) = diag(s − λ) with λ constant vector. The thresholding τ (S) improves the image reconstruction by removing the noise component while preserving image features. Weighted thresholds imply a different shrinkage to each singular value, as S hh −λ h , where S hh is the (h, h) entry of S. The proposed learning-based method trains a network to predict the optimal thresholds λ h . We present the computation of the optimal thresholds (Sect. 3.1), the related denoising accuracy (Sect. 3.2), the setting (Sect. 3.3), and optimisation (Sect. 3.4) of the learning model for the prediction of the thresholds.

Optimal thresholds
Given a n × n ground truth image X and a noisy instance Y = f (X, N), we compute the optimal thresholds λ h through the minimisation of the distance between the ground-truth and the reconstructed signalX, as where the reconstructed signal is computed through the SVD of the Y signal with weighted thresholds, i.e.
The first-order derivatives of the energy functional E(λ) with respect to the variables of the optimisation problem λ are ∂ λ k E = 2 n i, j=1 (1) is computed through L-BFGS (Limited-memory Broyden, Fletcher, Goldfarb, Shanno) [25], which finds the roots of the derivative of the energy functional. L-BFGS is an iterative optimisation algorithm in the family of quasi-Newton methods that estimates the inverse Hessian matrix for the minimum search in the variable space, by representing the approximation as a few vectors. At each iteration, a small history of the past updates of the variables (λ) and of the gradient of the energy functional E(·) in Eq. (1) is applied to identify the direction of steepest descent and to implicitly perform operations requiring vector products with the inverse Hessian matrix. The memory storage is O(u 2 ) and the computational cost is O(uv) at each iteration, where u is the total number of variables, and v is the number of steps stored in memory.

Optimal denoising thresholds
Through the optimal singular values thresholding, we set an upper limit to the accuracy of denoising.
Application of block matching Given a pixel y, the patch P y is the set of pixels in the neighbourhood of y. The search window is the set of selected patches used to search the closest ones to a reference patch, according to certain metrics.
The stack, or 3D block, is the set of patches similar to a reference patch. The block-matching algorithm stacks the similar patches into a 3D structure and uses the redundancy of the stack to remove the noise. Then, the patches are aggregated by a collaborative filter to compose the denoised image.
Optimal denoising: results We compute the optimal denoising of three images with a different noise. The method is applied for 20 iterations: at each iteration, the denoised image of the previous iteration is used as a noisy image, while the ground truth is fixed. We compute the 3D blocks with the block-matching algorithm and extract their groundtruth counterparts from the ground-truth image. Then, we compute the optimal thresholds λ h for the SVD through the minimisation of Eq. (1) for each 3D block, and we reconstruct the denoised image.
Considering the Gaussian, speckle, and salt-and-pepper noise (Fig. 2), the denoised images after 20 iterations are very close to the input image, thus showing that the nuclear norm minimisation recovers the ground-truth image with any type of noise, when the proper thresholds are selected. Most of the details are well reconstructed, such as the cameraman's face, the boat profile, or the mandrill zygoma. Images with a different noise, in terms of both type and intensity, have a different reconstruction accuracy. This result motivates us to distinguish learning-based networks according to the type of noise (Sect. 3.3). The optimal PSNR plot (Fig. 3, 1st row, blue line) provides the upper bound of the optimal denoising and allows us to compare the accuracy of our learning-based method with the optimal achievable results.
Optimal denoising: thresholds In Fig. 3(2nd row), the averaged optimal thresholds at the first iteration:λ it=1 where B is the total number of 3D blocks. Different types of noise have similar behaviour; for example, in (a) the first singular value is reduced to an average value of 0.14, which is less than the 1% of the average input value. Then, the optimal thresholds increase up to 1.2 for the fourth singular value, which is more than the 73% of the input singular value. Finally, the thresholds decrease: The reduction of the 49th singular value is about 0.2.

Training and clustering for learning
Applying an arbitrary noise to a set of input images, we build a data set of ground truth and noisy images to train a learningbased method to predict the optimal thresholds. In particular, the input of the network is the vector of the singular values of the noisy block and the target is the vector of the optimal thresholds (Eq. (1)).
Instead of defining a single data set for all the images, we organise the images according to four parameters: (i) type of noise, (ii) iteration, (iii) noise intensity, and (iv) singular values clustering. This classification improves the specificity of Fig. 4 Comparison between previous work and our denoising: a cameraman (256 × 256), speckle noise with σ = 0.05; b boat (256 × 256), salt-and-pepper noise with density 0.05; c mandrill, Gaussian noise with μ = 0, σ = 0.01; d magnification the network and the accuracy of the prediction of the thresholds. The four parameters are nested: For each image noise (i) and iteration step (ii), we compute the 3D block's noise intensity (iii) with the nonlinear noise estimator (NOLSE) [26] metric and define c intervals so that the NOLSE values are uniformly distributed into c classes. We cluster the singular values (iv) into k groups with the k-means algorithm; then, we train a support vector machine (SVM) [27] to classify the vectors of singular values according to the k-means labels. For each combination of the four parameters, we define the matrix P = {P k } N k=1 , with P k = [p 1 , . . . , p n ] , where N is the number of 3D blocks and n is the lower dimension of the 3D block; the p elements are the singular values computed through the SVD. These values are associated with the corresponding optimal thresholds Q = {Q k } N k=1 , computed as the minimisation of Eq. (1).

Learning model
For each combination of the four parameters (Sect. 3.3), we compute a dedicated learning-based network. Given the singular values of a 3D block, we train a network to predict the thresholds that reconstruct the best approximation of the ground-truth block. The network weights are defined through the matrix W of dimension n × n, and are computed by minimising the distance between the target and the predicted thresholds, as F(W) =

Training results
We select a data set of 39 images [28], of different resolutions and subjects. We apply an artificial noise to the images (e.g. Gaussian noise with mean 0 and variance 0.01); then, we compute the 3D blocks with block matching. We compute the input singular values and the target thresholds (Sect. 3.1). Images are classified according to the noise type (e.g. speckle, Gaussian) and intensity (e.g. σ = 0.05 in speckle noise) of the input image. Then, we split the data into c = 5 3D block noise intensity classes, and k = 3 k-means labels (Sect. 3.3), for a total of 15 instances for each iteration. The final data set is composed of a set of 15 matrices n× N , where N is the total number of 3D blocks, and n is the number of singular values (i.e. 49 in our tests, corresponding to a 7 × 7 patch). This approach is repeated at each iteration: for the computation of the data set at iteration t, the block matching is applied for (t − 1) iterations. At each iteration, the proper networks previously computed by the minimisation model (Sect. 3.4) are used to predict the thresholds and denoise the related 3D block. At iteration t, the optimal thresholds are computed through an optimisation step (Eq. (1)). In our tests, we apply 5 iterations of the training, for a total number of 75 networks (i.e. 5 × 3 × 5) for each type of noise and noise intensity.
Computation time The computation of the input/optimal thresholds (Eq. (1)) and of the network weights (Sect. 3.4) are performed offline. For the computation of the training data set, we used 39 images, for a total number of 257,240 couples of input/target vectors for each noise type and iteration. The execution time for the generation of the data set of each noise type/intensity is approximately 120 minutes for each iteration. For the computation of the weights, the variables of each network are 49 × 49 matrices; we consider 3 types of noise (i.e. salt and pepper, speckle, Gaussian), 5 iterations for each noise, 5 3D block's noise intensity classes, and three k − means labels, for a total number of 225 networks. The execution time of the training of each network is about 15 minutes. Tests have been executed with MATLAB R2020a, on a workstation with 2 Intel i9-9900KF CPUs (3.60 GHz), and 32GB RAM. Additional tests with increased noise intensity (i.e. Gaussian and speckle) and different types of noise (i.e. Poisson and exponential) are discussed in the additional material.

Denoising results
We discuss the generality of our method from the type of noise applied. Applying a speckle noise with σ = 0.05 (Fig. 4a), our method improves previous work and shows better preservation of the main features and details (e.g. the face of the man, the folds of the man's trousers), and good noise removal (e.g. the sky in the background). Furthermore, our method does not introduce blurred patterns. In comparison, WNNM removes a larger amount of noise, while losing several details (e.g. man's mouth), and increasing the blurring effect. On the boat image with salt-and-pepper noise with 0.05 density (Fig. 4b), our method better preserves main details, such as the clouds, or the geometries of the boat. In comparison, BM3D does not remove the "salt" noise on the bottom side of the boat. On the mandrill image with μ = 0 and σ 2 = 0.01 Gaussian noise (Fig. 4c), all methods have similar results, correctly remove the noise and preserve the main features of the image (e.g. the nose and the eyes of the mandrill). Only the mandrill's pupil in the NCSR method is less uniform than our method. Table 1 summarises the quantitative metrics of our method and previous work. With Gaussian noise, our method has comparable results with state-of-the-art methods, which are specific for this type of noise. In (c), the PSNR value is 26.01 for our method, 26.13 for WNNM (best result), and lower than 26 for BM3D and NCSR. Instead, our method has better results when other types of noise (e.g. salt and pepper, speckle) are applied. In Fig. 3, we compare the PSNR of our method with the optimal PSNR results (Sect. 3.1); in (a), our method has a PSNR value of 28.69 after 4 iterations, while the optimal PSNR value is 30.53. The error of the network prediction of the thresholds, even if small, leads our method to diverge from the optimal reconstruction. After five/six iterations, our method is not able to further improve the image approximation. Considering a large data set [29] of natural images of different resolutions and subjects, and applying different types of artificial noise, our method has slightly better results than previous work in the case of Gaussian noise, while it widely improves the PSNR values in the case of speckle and salt-and-pepper noise (Fig. 5).
Computation time The execution time for denoising each 3D block is 0.015 seconds; this time is composed of the noise intensity computation through the NOLSE metric (66%), the SVD (5%), the SVM prediction of the label of the singular values (14%), and the network prediction of the optimal thresholds (15%). Due to the operations performed for each 3D block, our method (330 s) increases the execution time for the denoising of a 256 × 256 image, with respect to previous work: BM3D (1 sec.), WNNM (100 s), and NCSR (180 s).

Selected thresholds
The training data set of the Gaussian noise with μ = 0 and σ 2 = 0.01 is composed of more than 257K elements, each one composed of a couple of vectors (i.e. input and optimal target) in R 49 . Figure 6a shows the partition of these elements, according to the 3D block noise intensity and the k-means classification. Since each 3D block noise intensity class contains the same number of elements, we have a regular partition of the training data set. In contrast, the k-means classification does not provide a uniform partition of the elements; indeed, some networks are trained with more data than others. However, each network has a sufficient number of elements due to the huge data set; in fact, the minimum number of elements is N = 4204, related to 3D block noise class c = 1 and k-means label k = 3. Figure 6b shows a comparison among the input, optimal, and predicted singular values of a specific 3D block. The optimal singular values reflect the thresholds in Fig. 3(2nd col.); for example, the shrinkage of the second singular value is larger than the first one. The predicted singular values are very close to the optimal singular values; in fact, the error for the prediction of a 49−dimensional vector is 0.6, which means an average error of 0.01 for each element of the vector. Analysing the error between the prediction and the optimal singular values (Q jk −Q jk ) 49 j=1 (Sect. 3.4), the first values show a higher prediction error than the last ones (Fig. 6c), due to the different magnitude and variability of these values. However, there is not a clear correlation between these factors. The correlation between the prediction error of the thresholds and the data clustering (Sect. 3.3) shows that the error increases with the 3D block noise intensity (Fig. 6d), as the noise intensity implies a more irregular distribution of the singular values, which is harder to train and predict. Instead, the SVM-predicted classes and the prediction error of the network do not show any correlation.

Conclusions and future work
We have proposed a denoising method that learns the optimal thresholds of the SVD from a training data set and applies a prediction of these optimal values to improve the image reconstruction. As main contributions, the proposed method improves previous work on denoising, and it is general with respect to the noise type and intensity. We have also analysed the upper bounds of the nuclear norm minimisation for the denoising of 2D images, and the learning results for the prediction of the optimal thresholds of the SVD. As future work, we plan to improve the approximation accuracy of the ground-truth image, by reducing the error of the thresholds prediction, extending the proposed method to different data (e.g. 3D images, 2D videos), and signal decompositions (e.g. wavelets). Trained networks are available at https://github. com/cammarasana123/denoise long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.