1 Introduction

Vision is an indispensable way for humans to obtain information and gain basic cognition and understanding of the world. However, the manual processing of large-scale image datasets is both time-consuming and laborious. Efficiencies can be achieved through utilization of machine vision methods (for example, via automated monitoring, object/scene identification, and segmentation), but image quality is a key factor in machine vision performance. For example, in clinical medicine, medical imaging has become an important auxiliary tool for physicians. High-quality medical images can provide clear organ tissue and function information and improve the efficiency and accuracy of diagnosis and treatment. However, the image is inevitably injected with different concentrations and distribution of noise during the process of generation, storage, transmission, and application. The edges and characteristic information of the image are covered or easily blurred, resulting in the deterioration of the image quality, which does not meet the actual application requirements in production and scientific research.

As the requirements for high-quality images continue to increase, denoising tasks have become a key branch in computer vision. The purpose of denoising is to remove image noise without losing critical information to the greatest extent possible and restore the latent clean image. Researchers have proposed different denoising methods, which are mainly divided into traditional [1,2,3] and deep learning methods.

Traditional methods include spatial filtering and frequency-based methods. Frequency domain methods transform the original signal into an easy pattern to denoise, such as a curvelet [4] and wavelet [5]. However, as far as wavelet transform is concerned, it fails to solve the problem of smooth edges. Spatial filtering is performed across a pixel’s neighborhood region [1,2,3]. Methods include mean filtering [8], median filtering [9], Wiener filtering [10], and non-local mean filtering [11]. Mean filtering takes the average value around a pixel neighborhood instead of the original pixel. This method is simple but is a smoothing operation and therefore blurs edge and detail features. Median filtering, a similar neighborhood operation which utilizes the median value of neighboring pixels, preserves image definition but fails to suppress specific noise. Non-local mean filtering cleverly uses the redundant information of the image to retain image sharpness and detail, but has a higher computational complexity. Anisotropic filtering [12, 13] overcomes edge blurring associated with Gaussian filtering and maintains the image edge, but generally speaking, traditional methods are simple, require parameterization, are lacking in generalization, and are prone to blur or smoothing problems.

In light of the limitations of low-level traditional techniques, there has been recent focus on the application of deep learning for image enhancement and noise reduction. Deep learning has demonstrated superior performance in image denoising when compared with traditional techniques, across a variety of noise distributions. Methods include DnCNN [14], IRCNN [15], FFDNet [16], VST-net [17], and RED30 [18]. Supervised learning requires clean images as labels to guide training, and it can be very difficult to collect a sufficiently large number of clean labeled images in fields such as medicine and biology due to instrument or cost constraints [19]. As a result, a considerable part of supervised learning can only be used in synthetic images and may not generalize well when applied to real image contexts.

Unsupervised learning avoids this burden, instead extracting the structural characteristics in the noisy data itself. Lehtinen J proposes the Noise2Noise [20], which does not require clean images and directly uses independent noise image pairs. The denoising performance is close to the supervised learning methods. This method requires two images with the same content and independent noise. In real scenes, it is not enough to obtain the noisy image pair [20], which greatly limits its practical application. To address these issues, Noise2Void [21], Noise2Self [22], Noise2Same [23], Noise2Sim [24], Self2Self [25], Deep Image Prior [26], 4D deep image prior [27] and convolutional blind spot neural networks [28] have been proposed. Noise2Void does not require noise image pairs and clean images as the target. It employs a blind spot network to predict the pixel and restores clean pixels from a single noise image by using neighboring pixel values. However, this method is not fully trained because of blind spots and makes prior assumptions about images and noise. If the conditions are not met, it will lead to poor denoising effects [28]. Inspired by non-local mean filtering, Noise2Sim utilizes the self-similarity of image to train the denoising network, which is innovative [29]. Noise2Atom [30] designs for scanning transmission electron microscopy images which takes advantage of two external networks, in order to apply additional constraints from the domain knowledge.

This paper builds upon the principle of Noise2Void, by proposing a pseudo-siamese network which utilizes noisy images directly without clean labels and noisy image pairs. The network fills blind spots of the image with two strategies. Specifically, one branch is directly filled with zeros, and the other uses random surrounding pixels. Different filling strategies construct different network inputs, and the two branches are, respectively, mapped to the new space to produce a representation. It is worth mentioning that the loss function designed in this paper consists of three parts, including the MSE loss of the prediction by the two branches and noise images, and the MSE loss between the two branches. Then, through the calculation of loss function, the similarity of the two inputs is narrowed to make it close to the potential clean image. Besides, the diverse filling strategy further avoids the problem of constant change. Although the parameters of the two branches are not shared, this paper improves the effective channel attention module [31] to communicate the two branches and adds the branch structure of the global maximum pooling.

2 Related work

2.1 Noise2Void

We use the formula \(x = s + n\) [32] to describe the generation of noise images. \(x\) represents the noise image, and \(s\) stands for the signal image without noise. \(n\) is the noise. From the noise generation process, the task of image denoising is deduced to separate the noise image \(x\) into two components: the clean signal \(s\) and noise \(n\). We subsequently represent a noisy image by the following joint distribution.

$$ p(s,n) = p(s)p(n|s) $$
(1)

And assume \(p(s)\) to satisfy Formula 2 and it is an arbitrary distribution in Noise2Void. In other words, the pixels \(s_{i}\) do not satisfy statistically independent in signals [21]:

$$ p(s_{i} |s_{j} ) \ne p(s_{i} ) $$
(2)

\(s_{i}\) and \(s_{j}\) represent two pixels in images within a certain radius distance. In terms of the noise \(n\), Noise2Void assumes that it satisfies the condition independent and zero-mean.

In order not to learn identity mapping, there is a special receptive field in the network that excludes the pixel of blind spot itself. In other words, the blind spot makes the predicted pixel value be affected by all the values of the rectangular neighborhood except the blind pixel \(x_{i}\) at position \(i\). In training, the network can be expressed as following:

$$ \arg \, \mathop {\min \, }\limits_{\theta } \sum\limits_{j} {\sum\limits_{i} {L(f(x_{RF(i)}^{j} ;\theta ) = \hat{s}_{i}^{j} ,s_{i}^{j} )} } $$
(3)

Here, the denoising network is regarded as a function \(f\), where \(x_{RF(i)}^{j}\) is a patch around \(i\), and the output is the pixel \(\hat{s}_{i}^{j}\). θ denotes the parameter of denoising network, and \(s_{i}^{j}\) is the target pixel corresponding to the input. \(L\) stands for the standard MSE loss.

In summary, Noise2Void opens the doors to a plethora of applications without large-scale clean images. Compared with the supervised algorithm, although it can achieve good denoising effects, it requires clean images to guide training, which greatly limits the practical application. Intuitively, N2V cannot be expected to perform better than supervised learning methods, but experiments show that it just drops in moderation. However, Noise2Void fails to make full use of the blind spot value for training and is inefficient [32]. There is still a certain gap between the results of Noise2Void denoising and the requirements of reality on the image.

3 Method

3.1 Pseudo-siamese network

The overall framework of the pseudo-siamese network proposed in this paper is shown in Fig. 1. The blind pixels in input images are randomly selected, in which one branch is filled with zeros and the other branch randomly selects neighboring pixels. Each branch utilizes the residual network to predict the blind pixels and employs channel attention to connect the two branches and then weigh the importance of different channels. At the end of the network, a similarity calculation is used to make the results of the two branches closer to each other, with the aim of staying close to potentially clean images and preventing one branch from going in the wrong direction. During the training and testing of the pseudo-siamese network, clean images are not required; thus, it shows great potential in scenes such as medical and biological where it is difficult to obtain clean images.

Fig. 1
figure 1

The framework of the pseudo-siamese network, including convolution module, attention module, resblocks module and blind spot filling module. The parameters of resblocks in two branches are not shared but are composed of the same structure of residual modules

According to the structure of pseudo-siamese network in Fig. 1, the input is noisy images, which is defined as \(N\). The outputs of the two branches are denoted by \(D,D^{\prime}\), respectively. Each module is recorded as a function \(f\); hence, the processing of the pseudo-siamese network is represented as

$$ D = f_{{{\text{conv}}1}} (f_{{{\text{att}}}} (f_{{{\text{res}}1}} (f_{{{\text{bsfn}}}} (N)))) $$
(4)

where \(f_{{{\text{conv}}1}} ,f_{{{\text{att}}}} ,f_{{{\text{res}}1}} ,f_{{{\text{bsfn}}}}\) denote the functions of convolution, attention, resblocks, blind spot filling module in first branch with neighboring pixel, respectively. It is worth noting that for testing, this paper uses this branch.

Similarly, a branch network filled with zero is represented as follows

$$ D^{\prime} = f_{{{\text{conv}}2}} (f_{{{\text{att}}}} (f_{{{\text{res}}2}} (f_{{{\text{bsfz}}}} (N)))) $$
(5)

where \(f_{{{\text{conv}}2}} ,f_{{{\text{att}}}} ,f_{{{\text{res}}2}} ,f_{{{\text{bsfz}}}}\) are defined the functions of convolution, attention, resblocks, blind spot filling module in second branch with zero, respectively. According to Formula 4 and 5, several observations can be made. Filling module and resblocks are not shared. In both branches, the parameters of the attention mechanism are shared. This paper also conducted experiments on shared parameters in resblocks, but the result is worse than that of separation. In addition, the attention mechanism is to communicate the two branches. On the whole, the pseudo-siamese network is both separate and connected.

3.2 Blind spot filling module

For a general network using exactly the same noisy image as input and label, the network will easily learn the identity mapping and cannot effectively extract image information, as shown in Fig. 2a. Therefore, based on Noise2Void, each branch of the pseudo-siamese network utilizes a blind spot to avoid identity mapping. Specifically, the network uses other values to replace the blind spot and then employs the information of rectangular neighborhood to predict, as shown in Fig. 2b. Although the input and target in the network are essentially noisy images, the network assumes that noise is condition independent and that neighboring information cannot predict the noise; hence, the hidden clean pixel values are extracted. As the network training converges, each branch eventually learns to remove the noise on the pixel and obtain clean images. The pixel values output by the two branches are constantly similar, so both effectively approach the direction of a clean image.

Fig. 2
figure 2

The general network and the blind spot network. The general network takes the entire noisy image as input, and the blind spot network replaces the blind spot pixels as input to avoid identity mapping

3.3 Attention module

Recently, numerous building modules have improved the achievable accuracy of deep learning and have shown extraordinary potential, such as attention [33], dilated convolution [15], memory block [34], and wavelet transform [35]. Such endeavors have been at the expense of computational efficiency. Based on the SENet [36] module, Efficient Channel Attention is lightweight with high efficiency and adaptively determines the size of the one-dimensional convolution kernel through the channel dimension. However, only global average pooling [37] is used in ECA. This paper adds the branch of global maximum pooling to extract more diverse feature textures, as illustrated in Fig. 3. After sum the results of the global average pooling and global max pooling, the sigmoid is performed, and finally, element-wise product is used with the input.

Fig. 3
figure 3

Add the branch of global maximum pooling to extract more diverse feature textures in attention module

Attention communicates two branches that do not share parameters and gives different weights to different channels to train them in a targeted way. On the one hand, the attention mechanism in this paper automatically adjusts the weight of each channel through training, so as to enhance the role of useful channels and suppress the irrelevant channels; on the other hand, it extracts more diverse channels through the global maximum pooling of another branch, which improves image denoising effect.

3.4 Loss function

The loss function designed in the paper is defined as follows

$$ \arg \min_{\theta } \left( \begin{gathered} 5*L\left( {f\left( {x_{{{\text{neighbor}}}} ;\theta } \right),y} \right) + \hfill \\ 2*L\left( {f\left( {x_{{{\text{zero}}}} ;\theta } \right),y} \right) + \hfill \\ 3*L\left( {f\left( {x_{{{\text{zero}}}} ;\theta } \right),f\left( {x_{{{\text{neighbor}}}} ;\theta } \right)} \right) \hfill \\ \end{gathered} \right) $$
(6)

\(x_{{{\text{neighbor}}}}\) represents the blind spot values replaced by random neighboring pixels. \(x_{{{\text{zero}}}}\) is the blind spot values directly replaced by zero. \(y\) stands for the label images, which is the original noise images. \(\theta\) denotes the parameters of pseudo-siamese network.

The loss function in this paper consists of three parts, including the loss between predictions of the two branches and label, and the similarity between the two branches. Experiments compare the three parts of the loss with the same weight and different weights. The results show that the PSNR of different weights is higher than the same. Therefore, we multiply the first branch and the second branch by 5 and 2, respectively, and compare the loss of similarity between two branches with weight 3. The weight is assigned mainly considers the first branch will be utilized to test, while the second branch is auxiliary. In addition, the similarity of the two branches is compared so that each branch can tend to clean labels during training.

Defining the pseudo-siamese network as a function \(f\), \( \, y_{{{\text{neighbor}}}}\) and \( \, y_{{{\text{zero}}}}\), and representing the output of branch with \(x_{{{\text{neighbor}}}}\) and \(x_{{{\text{zero}}}}\), respectively, \(\theta_{\begin{gathered} {\text{neighbor}} \hfill \\ \end{gathered} }\) and \(\theta_{\begin{gathered} {\text{zero}} \hfill \\ \end{gathered} }\) denote the parameters of two branches, then

$$ \begin{gathered} f\left( {x_{{{\text{neighbor}}}} ;\theta_{{{\text{neighbor}}}} } \right) = \, y_{{{\text{neighbor}}}} \hfill \\ f\left( {x_{{{\text{zero}}}} ;\theta_{{{\text{zero}}}} } \right) = \, y_{{{\text{zero}}}} \hfill \\ \end{gathered} $$
(7)

In the pseudo-siamese network, the mean square error (MSE) loss is utilized. The number of samples is \(m\). The prediction result and label are denoted as \(y_{{{\text{result}}}} ,y_{{{\text{label}}}}\), respectively. The loss \(L\) is defined as follows:

$$ L\left( {y_{{{\text{result}}}} ,y_{{{\text{label}}}} } \right) = \frac{1}{m}\sum\limits_{i = 1}^{m} {\left( {y_{{_{{{\text{result}}}} }}^{\left( i \right)} - y_{{{\text{label}}}}^{\left( i \right)} } \right)^{2} } $$
(8)

4 Experiment and analysis

This paper uses PSNR to objectively analyze the experimental results and displays the denoised images for intuitive visual inspection. For comparison, we investigate the performance of the pseudo-siamese network with BM3D [38], non-local mean filtering, median filtering, mean filtering, and Noise2Void across a variety of noise levels.

4.1 Magnetic resonance imaging data

With the continuous development of artificial intelligence and image processing techniques, magnetic resonance imaging (MRI) has been widely adopted in medicine. The superiority of MRI [39,40,41,42,43] is primarily owned to non-radiation, non-invasiveness, and high resolution. The effective acquisition of original three-dimensional cross section imaging and multi-directional images without reconstruction is also the preponderance of MRI. Compared with computed tomography (CT) [44], MRI has a significant advantage in the clarity of details of the central nervous system, joints, muscles, and other parts.

MRI the equipment captures images by collecting k-dimensional spatial data and performing Fourier transform [45]. The magnetic resonance coil of MRI equipment contains real part and imaginary part signals, and the phase difference between them is 90 degrees. Both real and imaginary signals contain additive white Gaussian noise with the mean value of zero, the variance of which is the same and independent.

According to the noise distribution, it has been demonstrated that MRI obeys Rician distribution [46,47,48,49]. The signal-to-noise ratio (SNR) [50], which refers to the ratio of the power spectrum of the signal-to-noise, is an important condition to estimate the noise distribution of MRI. In low SNR, Rician noise presents Rayleigh distribution [51], while it obeys Gaussian distribution in high SNR. Therefore, the complexity and variety of Rician noise increase the difficulty of denoising and make it become a huge challenge. To the best of our knowledge, few unsupervised learning methods have been applied to MRI denoising. Consequently, the research is of practical significance and theoretical value.

All experiments in the paper use the Alzheimer’s disease neuroimaging initiative dataset (ADNI), which is a public real brain dataset. We obtain a total of 199 MRI three-dimensional images, and each three-dimensional image is sliced along the axial plane. The images numbered 37–86 were selected to add Rician noise with noise level 10, 20, and 30, respectively. In all, there are a total of 9750 two-dimensional images, which are divided into 7750 training sets, 1000 test sets, and 1000 verification sets. All images are 145 × 121 in size and are grayscale. During the experiment, Adam [52] is used for optimization, and the window setting size is 5 × 5.

4.2 Results analysis and discussion

4.2.1 Qualitative metrics

The experiment is carried out on Ubuntu 18.04, and the deep learning framework utilizes Pytorch 1.1.0. in Python. For experimental acceleration, NVIDIA GeForce GTX 1060 is used in the experiments.

In the field of image denoising, the commonly used quantitative evaluation is the peak signal-to-noise ratio (PSNR), which calculates the degree of distortion between a denoised image \(p\) and a clean image \(q\). PSNR is the ratio between the maximum possible power of a signal and the power of corrupted noise that affects the fidelity of its representation [53, 54]. The larger the value, the smaller the distortion. The unit is dB. For a given \(M \times N\) image, PSNR is defined as follows:

$$ {\text{PSNR}}\left( {p,q} \right) = 10{\text{log}}_{10} \frac{{255^{2} \times M \times N}}{{\sum\nolimits_{{\left( {x,y \in \Omega } \right)}} {|p\left( {x,y} \right) - q\left( {x,y} \right)|^{2} } }} $$
(9)

Experimental results of PSNR are shown in Table 1. Compared with the traditional methods based on transform domain and filtering, the pseudo-siamese network proposed in this paper greatly improves the denoising performance under various noise levels. For BM3D, our method has an improvement of more than 6 dB at all different noise levels; when the noise level is 30, the improvement can reach up to 7.75 dB. Compared with all the traditional methods, NLM achieves better denoising effects. In the noise level of 10 and 20, NLM has the highest PSNR, indicating that it is closest to a clean image in most traditional methods. However, compared with the pseudo-siamese method, NLM still has a large gap. When the noise level is 30, the gap of PSNR is 7.85 dB. Results clearly demonstrate the effectiveness of the pseudo-siamese network and are higher than, or comparable to, traditional methods.

Table 1 PSNR of different methods. The best PSNR at different noise levels is highlighted in bold

It is worth mentioning that the traditional method has a poor denoising effect, while the two deep learning methods can reach more than 25 dB. In terms of the reference algorithm Noise2Void, when the noise level is 20, the performance of this paper is improved by 0.50 dB, but the two are equivalent under the 30 noise level, in which the difference is only 0.10 dB.

In order to illustrate the effectiveness of each module, ablation experiments are carried out based on the above settings. From Table 2, compared to Noise2Void at noise level 10, the pseudo-siamese network predicts clean images more effectively, which increases 0.14 dB. And attention with global maximum pooling branch increases 0.30 dB, which indicates that the improved attention extracts feature more fully.

Table 2 PSNR of ablation experiments. The best PSNR at noise level 10 is highlighted in bold

4.2.2 Quantitative metrics

The PSNR as an objective evaluation criterion is with certain limitations. Therefore, in some cases, the value of PSNR is higher, but the visual effect on the image is poor. In view of above, this paper visually compares the denoised images, as shown in Fig. 4.

Fig. 4
figure 4

Visual inspections denoised by traditional methods. a Images with different noise levels. b The results of BM3D. c The results of mean filtering. d The results of median filtering. e The results of NLM. The results of our pseudo-siamese network. f The results of our pseudo-siamese network

It can be seen from Fig. 4 that the traditional method has poor denoising effects on MRI with high noise levels. In contrast to clean labels, BM3D has a better recovery effect on image details than mean filtering and median filtering, but it causes a large-scale blur in the left half of MRI. In addition, it is similar to NLM when the level is 30, background noise cannot be effectively removed, and the denoising effect of NLM at noise level 30 cannot be seen intuitively. For the mean filtering and median filtering, they are with a certain denoising effect. However, the key details of the image are greatly blurred, which resulting in a decrease in image quality, and at a more extreme noise level of 30, the brain edges can no longer be clearly observed. As a whole, it is easy to blur the images and lose most of the tissue details with traditional methods.

In contrast, the pseudo-siamese network proposed in this paper can effectively eliminate noise and restore the original information better. Besides, the edge parts are relatively sharper, and the processing of background is also cleaner than traditional methods. In summary, compared with other experimental methods, the proposed pseudo-siamese network achieves the best effects in both qualitative and quantitative metrics.

5 Conclusion

It is costly to obtain a large number of clean images and noisy pairs in a real image scene. Therefore, the application of supervised methods in this domain is limited. Inspired by the blind spot network Noise2Void, this paper designs a new pseudo-siamese network and combines channel attention with only noisy images. As part of this, the network employs two different strategies to fill the blind spot for different branches which fill neighboring pixel values and zeros, respectively. On the one hand, the neighboring pixel features are utilized to predict corresponding blind spots, and on the other hand, the results of the two branches are similar to the final clean image. This paper conducts experiments to verify the effectiveness of the pseudo-siamese network. Compared with traditional methods under three noise levels, the performance of ours has been greatly improved in terms of qualitative metrics and quantitative metrics. In summary, the method reduces the distortion of the output denoised image with noisy images, which opens a door to medical, biological, and other fields.

However, the modules added have increased the complexity and calculations in the network to a certain extent. In the later research, the trade-off between efficiency and effectiveness will be more considered, and if the noise distribution does not meet the preset of Noise2Void, how to improve the effect in denoising will be studied further.