Real-time denoising of ultrasound images based on deep learning

Ultrasound images are widespread in medical diagnosis for muscle-skeletal, cardiac, and obstetrical diseases, due to the efficiency and non-invasiveness of the acquisition methodology. However, ultrasound acquisition introduces noise in the signal, which corrupts the resulting image and affects further processing steps, e.g. segmentation and quantitative analysis. We define a novel deep learning framework for the real-time denoising of ultrasound images. Firstly, we compare state-of-the-art methods for denoising (e.g. spectral, low-rank methods) and select WNNM (Weighted Nuclear Norm Minimisation) as the best denoising in terms of accuracy, preservation of anatomical features, and edge enhancement. Then, we propose a tuned version of WNNM (tuned-WNNM) that improves the quality of the denoised images and extends its applicability to ultrasound images. Through a deep learning framework, the tuned-WNNM qualitatively and quantitatively replicates WNNM results in real-time. Finally, our approach is general in terms of its building blocks and parameters of the deep learning and high-performance computing framework; in fact, we can select different denoising algorithms and deep learning architectures.


Introduction
Ultrasound imaging uses high-frequency sound waves to visualise soft tissues, such as internal organs, and support medical diagnosis for muscle-skeletal, cardiac, and obstetrical diseases, due to the efficiency and non-invasiveness of the acquisition methodology. Ultrasonic sound waves are reflected off from different layers of body tissues. The main issues of the ultrasound techniques are a significant loss of information during the reconstruction of the signal, the dependency of the signal from the direction of acquisition, an underlying noise that corrupts the image and significantly affects the evaluation of the morphology of the anatomical district. In this context, the denoising of ultrasound images is relevant both for post-processing and visual evaluation by the physician.
Our main goal is the definition of a novel deep learning framework for the real-time denoising of ultrasound images (Fig. 1). After the design of a training data set, composed of raw images and the corresponding denoised images, we train a neural network that replicates the denoising results. Then, the real-time denoising is achieved through the prediction of the trained network. The proposed framework combines three elements: low-rank denoising, deep learning, and highperformance computing (HPC, for short).
We select WNNM-Weighted Nuclear Norm Minimisation [27] as the best denoising method, which is then specialised to ultrasound images as a "new" tuned-WNNM denoising, by tuning its parameters. The choice of WNNM is based on a qualitative and quantitative comparison of five denoising methods, i.e. WNNM, SAR-BM3D -SAR Block-Matching 3D [48], BM-CNN -Block Matching Convolutional Neural Network [5], NCSR -Non-Locally Centralised Sparse Representation method [19], PCA-BM3D Principal Component Analysis Block Matching 3D [16]) belonging to the spectral, low-rank, and deep learning classes (Section 2).
To achieve a real-time denoising of ultrasound images, we propose a deep learning framework that is based on the learning of the tuned-WNNM and HPC tools (Section 3). The training is performed offline and can be further improved with new data, a priori information on the input images or the anatomical district, and denoised images selected after experts' validation. Through our framework, the execution time of the denoising only depends on the network prediction, which is achieved in real-time on standard medical hardware.
As the main contribution, the proposed denoising of ultrasound images runs in real-time and is general in terms of the input data, in terms of resolution of the input images (e.g. isotropic, anisotropic images), acquisition methodology, anatomical district, noise (e.g. speckle or Gaussian noise), and the dimension of the images (i.e. 2D, 3D images). Our approach is also general in terms of the building blocks and parameters of the deep learning framework; in fact, we can select different denoising algorithms (e.g. WNNM, SARBM3D) and deep learning architectures (e.g. Pix2Pix, VGG19).
As experimental validation (Section 4), we perform a quantitative (e.g. PSNR, SSIM) and qualitative evaluation of the selected denoising methods on ultrasound images acquired from different anatomical (e.g. muscle-skeletal, obstetric, and abdominal) districts. Then, the results of the deep learning and HPC frameworks are quantitatively and qualitatively analysed on a large collection of ultrasound images. The industrial requirement of real-time denoising is verified in terms of the execution time of the network prediction on GPU-based hardware installed in commercial ultrasound machines. Finally, we discuss main results (Section 5), conclusions, and future work (Section 6).

Related work
We review previous work on image denoising (Section 2.1) and deep learning methods for denoising and regression (Section 2.2).

Image denoising
Non-local methods The Non-Local Means (NLM) denoising [11] uses the patterns' redundancy of the input image, and each patch is restored with a weighted average of all the other patches, where each weight is proportional to the similarity among the patches. The Bayesian non-local mean filter [30] improves the NLM with the introduction of a Bayesian estimator as distance measure among the patches, which allows the user to better determine the amount of denoising by the noise variance of the patch. The anisotropic neighbourhood in NLM [44] uses image gradient to estimate the Fig. 1 Pipeline of the proposed framework, with a magnification of prediction (red) and target (blue) images (right side). Our framework applies deep learning and HPC to learn and replicate the denoising results of state-of-the-art low-rank method (i.e. tuned-WNNM), in real-time and with a specialisation to anatomic districts edge orientation and then adapts the patches to match the local edges. The characterisation of the patches through a redundancy index [45] improves the self-similarity computation among patches. The improvement on the structure of the search window is achieved through the computation of an optimal search window for each pixel [63], according to the denoising degree of the related patch.

Anisotropic methods
The denoised image is computed as the solution to an anisotropic diffusion equation [49,50], where the gradient of the image guides the diffusion process. The variant [72] exploits the Lee [35] and Forst [24] filters, which are edge-sensitive to speckle noise. An improvement of the previous results [6] is achieved by applying the Kuan filter [34] in the diffusion equation and revising the selection of the neighbourhood used for the estimation of the statistical parameters. The anisotropic method introduces a class of fractional-order anisotropic diffusion equations [7], using the Fourier transform to compute the fractional derivatives, and the discrete Fourier transform to compute the fractionalorder differences.
Spectral denoising Denoising based on spectral decomposition transforms a signal into its spectral domain and exploits the sparsity of the transformed signal to remove noise, through a threshold operation. Several transformations have been applied to image denoising, such as Wavelets [13,39,46,51], Curvelets [61], Contourlets [14], and Shearlets [71]. To reconstruct the denoised image, the 3D blockmatching [15] computes and stacks similar patches through NLM; each stack is transformed into its spectral domain with wavelet decomposition, denoised through a hard/ soft threshold, and reconstructed in the space domain. The denoised patches are aggregated by a collaborative filter. The synthetic aperture radar block matching 3D (SAR-BM3D) [48] introduces a speckle-based variant of 3D block matching; the similarity among the patches is computed by considering the probability distribution of the speckle noise as a distance metric. Furthermore, the hard/soft threshold of the wavelet transformed signal is substituted by a Local Linear Minimum Mean Square Error (LLMMSE) filter. The principal component analysis block matching 3D (PCA-BM3D) [16] improves the stacking operation of 3D block-matching by using shape-adaptive neighbourhoods, which enable its local adaptability to image features. The 3D transformation of each stack to the spectral domain is performed through the PCA [66] and an orthogonal 1D transformation in the third dimension.
Low-rank methods Low-rank approximation computes the denoised image as the solution to a weighted minimisation problem, whose cost function is the Frobenius norm [12,49,60] or the ℓ 1 norm [20], between the input and the target images. The relation between local and non-local information [18] allows us to estimate signal variances, by interpreting the Singular Value Decomposition (SVD, for short) through a bilateral variance estimation. In [54], a high-order SVD is applied to 3D blocks, and the denoised image is achieved with hard thresholding of the decomposed signal. The Weighted Nuclear Norm Minimisation (WNNM) [27] computes the stacks as in the 3D block-matching method, performs the SVD on the stacks and applies a weighted threshold to the singular values, where higher weights correspond to lower singular values, which capture the noisy component of the image. The collaborative filtering of WNNM for the aggregation of the denoised patches is performed as in the 3D block-matching method. The weighted nuclear norm and the histogram preservation [74] are combined in a single constrained optimisation problem, which is solved through the alternating direction method of multipliers [10]. The WNNM is extended to image deblurring with several types of noise [40].

External learning
The K-SVD algorithm [4] represents the signal as a linear combination of an over-complete dictionary of atoms, which are iteratively updated through the SVD of the representation error to better fit the data. A learned simultaneous sparse coding method [43] integrates sparse dictionary learning with non-local self-similarities of natural images. The non-locally centralised sparse representation (NCSR) [19] exploits the non-local redundancies, combined with local sparsity properties, to estimate the coefficients of the sparse representation of the input image. The dictionary is learned by clustering the patches of the image into K clusters through the K-means [42] method and then learning a PCA sub-dictionary for each cluster. This method has been further improved in [69] with a fast version based on a pre-learned dictionary and achieving an improvement of the computational efficiency. The structured sparse model selection over a family of learned orthogonal bases [41] is applied to the deblurring of images with Gaussian noise.

Deep learning for denoising and regression
Deep learning methods for denoising In the Noise2Noise algorithm [36], the network learns to denoise images only considering the noisy data, without any knowledge of the ground truth. The Noise2Void algorithm [33] further expands this idea, and it does not require couples of noisy images for the training. This approach is relevant in biomedical fields, where there are no ground truth images. The Noise2Self method [8] proposes a self-supervised algorithm that does not require any prior information on the input image, estimation on the noise, or ground truth data. The denoising of images [21] is achieved through the extraction of features from the noisy image through a convolutional neural network (CNN), and combining the edge regularisation with the total variation regularisation. The combination of CNN and low-rank representation [25] is applied to detect anomalous pixels in hyperspectral images. The multilevel wavelet convolutional neural network [67] is applied for restoring blurred images affected by Cauchy noise. The block matching Convolutional Neural Network (BM-CNN) [5] integrates a deep learning approach with the 3D blockmatching method; the denoising of the stacks is predicted through a DnCNN [73], which is trained with a data set of 400 images corresponding to more than 250K training samples. A feed-forward Convolutional Neural Network smooths images, independently from the noise level, by exploiting residual learning and batch normalisation. Then, the blocks are aggregated and the image is reconstructed, as in the 3D block-matching algorithm.

Deep learning methods for image-to-image regression
The VGG19 [56] introduces Convolutional Neural Networks (CNN) pushing the depth to 16-19 weight layers, using small convolution filters of 3 × 3 size, with an application to large scale images classification. The Pix2Pix [29] method is a Generative Adversarial Network (GAN), where the generator is a U-net [55], the discriminator is an encoding network [32], and the loss function is based on the binary cross-entropy. The deep convolutional generative adversarial network [53] applies unsupervised learning for image classification and the generation of natural images, exploiting batch normalisation, rectified linear unit activations, and removing fully connected hidden layers.

Method
We propose a novel deep learning framework for the realtime denoising of ultrasound images. Firstly, we introduce the data sets and metrics for the selection of the denoising method that best fits our requirements for ultrasound images (Section 3.1). Then, we optimise the parameters of the selected method to the denoising of ultrasound images (Section 3.2). Finally, we introduce a deep learning (Section 3.3) and HPC (Section 3.4) framework, which achieves real-time denoising. For a detailed discussion on the results, we refer the reader to Section 4.

Data sets and metrics for denoising evaluation
We compare five denoising state-of-the-art methods, which are either specific for speckle noise (e.g. SAR-BM3D) or independent of the type of noise (e.g. WNNM). We consider the Esaote private data set, which contains more than 10K ultrasound images at different resolutions, and is acquired from different (e.g. muscle-skeletal, obstetric, abdominal) anatomical districts. On this data set, we compare the performance of different denoising methods applied to ultrasound images, and analyse the performance of the proposed framework, through the training and the prediction of the learningbased network, with a specialisation to anatomic districts.
As quantitative metrics, we consider the peak-signal-tonoise ratio (PSNR) and the structural similarity index measure (SSIM) for the comparison of the raw input with the target denoised image provided by the proposed framework. Furthermore, we integrate the quantitative metrics with a qualitative discussion on the quality of the denoised images, in terms of blurring and edge preservation.

Tuned-WNNM
According to the results in Section 4, WNNM has been selected as the best denoising method among five state-ofthe-art methods belonging to the spectral, low-rank, and deep learning classes. To improve the quality of the denoised image, we propose a novel approach to the tuning of WNNM parameters, and we refer to this method as tuned-WNNM.
Given a pixel x, the patch P x is the set of pixels in the neighbourhood of x; each pixel of the image has a related patch. The search window is the set of patches selected for searching the closest ones to a reference patch, under a certain metric. The stack, or 3D block, is the set of patches that are similar to a reference patch; these patches are stored in a 3D structure, and the redundancy of the stack is exploited to remove the noise. Within this setting, the tuning of these parameters (i.e. search window, stack, patch size) improves the results of tuned-WNNM with respect to WNNM. Our framework maximises the performance of the denoising method; in fact, the real-time requirement is achieved by the network prediction, while the denoising is applied offline for the generation of the training data set.

Real-time denoising with deep learning
The main requirements of a denoising algorithm for ultrasound images are the magnitude of the removed noise, edge preservation, and real-time computation. The tuned-WNNM satisfies these requirements, except for the execution time, which does not satisfy the real-time need of ultrasound applications. To achieve a real-time computation and to maintain the good results of the tuned-WNNM method in terms of denoising and edges preservation, we identify two strategies. We develop (i) a computationally optimised version of the tuned-WNNM method, exploiting HPC tools, CPUs and GPUs, and low-level programming languages. We design and implement (ii) a deep learning framework that uses the tuned-WNNM as an instance of denoising methods.
The implementation of a computationally optimised, and potentially real-time, version of the tuned-WNNM is a very tough requirement; the main iterative cycle of the algorithm is not parallelisable, due to the dependency of the data among the iterations. Furthermore, the cubic computational cost for the evaluation of the SVD of each block is no further optimisable. The real-time requirement needs a specific hardware-based implementation, and any modification to the method requires a new implementation of the parallel algorithm. This approach needs dedicated and more expensive hardware, which is in contrast with the cheapness of the ultrasound acquisition.

Proposed approach and contributions
The proposed realtime denoising is based on the training of a neural network to learn and replicate the tuned-WNNM behaviour. In the first phase, the network is trained on a data set of ultrasound images of the same district. The data set for the training of the learning method is composed of a set of couples of ultrasound images: the input (i.e. the raw image) and the target (i.e. the image denoised with the tuned-WNNM filter). The ground truth is not available in ultrasound applications; for this reason, the target of the learning method is the output of the tuned-WNNM filter. Then, the trained network provides the denoised output through a real-time prediction of the test images. As the main contribution, the proposed deep learning framework is general in terms of the input data, i.e. type of noise (e.g. speckle, Gaussian noise), resolution (e.g. isotropic, anisotropic) of the input images, acquisition methodology, and the anatomical district. Our deep learning framework is also general in terms of building blocks and parameters: since different methods (e.g. NCSR, SAR-BM3D, custom methods) have good performances, the generality of our framework allows us to use a different denoising algorithm and to exploit its different characteristics in terms of denoising and edge preservation. Alternative denoising methods can be used for different types of noise (e.g. speckle, Gaussian noise) and signals (e.g. 3D images, time-dependent ultrasound videos).
In our approach, we specialise the training phase to specific anatomical districts or types of noise. For instance, we train a specific network for each district, thus obtaining a more precise result when predicting the denoised image, as each network is specialised to the input anatomical features. We also improve the WNNM denoising in terms of real-time computation based on offline training. The real-time computation depends only on the execution time of the network prediction. Furthermore, the offline training can be improved with new data, a priori and/or additional information on the data (e.g. input anatomical district, noisy type/intensity, image resolution, acquisition methodology/protocol). The update of the existing training data set is always performed offline, through the addition of new images after expert validation of the denoising results.
Deep learning architecture To evaluate the proposed framework, we analyse several networks and perform an imageto-image regression; among them (Section 2), we select Pix-2Pix, which guarantees good results in terms of learning. We specialise Pix2Pix to ultrasound images, with two updates: (i) the introduction of a validation data set of the same district of the training data set, which forces the exit condition when the validation error increases, and (ii) the introduction of padding and masking pre-processing operations, which allow us to deal with images of different resolution, without an image resize that would imply a distortion artefact. A comparison between Pix2Pix and Matlab CNN is discussed in Section 4.5.
Training data sets We generate and test different data sets, by varying the number of images for the training phase, and the anatomical district for the prediction phase. The custom Pix2Pix network is trained on four data sets of obstetric images, with respectively: (a) 500, (b) 1500, (c) 3500, and (d) 5000 images. Each data set is composed of the input images (i.e. the raw Esaote data set) and the target images (i.e. the corresponding images denoised with the tuned-WNNM). The validation data set is composed of an additional set of different images (i.e. about 10% of the training data set) of the same district. Then, we evaluate each of the four networks (i.e. the networks trained with a different number of images) with two different test data sets of 50 images each, respectively from the (i) obstetric and (ii) muscular anatomical districts. For each test data set, we compute the quantitative PSNR and SSIM metrics between the prediction of the network and the expected target; furthermore, the experts visually evaluate the prediction results.

HPC framework for learning
We define an HPC implementation of the proposed deep learning framework, taking advantage of a large ultrasound data set with 5K ultrasound images, and of the CINECA-Marconi100 cluster, exploiting both CPUs (IBM POWER9 AC922) and GPUs (NVIDIA Volta V100). Given a training data set, composed of raw images and the corresponding denoised images, we implement a parallel and distributed deep learning framework in TensorFlow2. Then, we define a batch file for the execution of the deep learning framework on the cluster, which specifies the number of nodes, CPUs, GPUs, and memory of the cluster. Through the proposed HPC framework, we train multiple networks with large data sets in a reasonable time for the target medical application, thus increasing the specialisation to anatomical districts, and consequently the accuracy of the deep learning framework. The HPC framework generates a network model that is stored and used for predicting the output results. Furthermore, we can improve the offline training with new data, a priori and/or additional information on the input data (e.g. input anatomical district, noisy type/intensity, image resolution, acquisition methodology/protocol). The training data set can be periodically updated with the denoised images after the expert validation of the denoising results.

Results
We present the results of denoising methods with a specialisation to ultrasound images (Section 4.1), a comparison between baseline and tuned-WNNM (Section 4.2), deep learning (Section 4.3) and HPC (Section 4.4) framework. Finally, we compare the real-time denoising with different neural network architectures (Section 4.5).

Comparison of denoising methods
We evaluate the denoising results on different anatomical districts of the Esaote data set (Figs., 2, 3, 4): WNNM, NCSR, and PCA-BM3D have been judged as the best methods in terms of denoising, and WNNM outperforms all the other methods in terms of edge preservation and enhancement. In particular, WNNM well preserves the edges of the muscular fibres (Fig. 2) and the internal organs (Figs. 3, 4). The output of SAR-BM3D shows a granular effect, which negatively affects the preservation of the anatomical features, and BM-CNN generates artefacts, which are typical of a deep learning approach. According to these results, we select WNNM as the best method for the denoising of ultrasound images. However, we underline that the other methods have their characteristics in terms of denoising and edge preservation, and they could be included in the framework as alternative denoising algorithms. (Table 1) are executed with Matlab R2020a, on a workstation with 2 Intel i9-9900KF CPUs (3.60GHz), 32 GB RAM, and none of these methods achieves real-time computation. In particular, WNNM takes more than three minutes to process a 600 × 485 image, and the fastest method (i.e. SAR-BM3D) takes about one minute; however, real-time computation in an ultrasound environment requires a processing time in the order of a few milliseconds. This result motivates the proposed development of a deep learning framework for the real-time denoising of ultrasound images, further optimised with a HPC framework (Section 3.4).

Tuned-WNNM for US images
We implement the tuned-WNNM through the optimisation of the following parameters. The number of patches is no  more limited by the step value (e.g. 1 patch every 2 or 3 pixels) and we assign a patch for each pixel; this parameter allows us to increase the number of processed patches, thus improving the data redundancy. The block-matching algorithm is now performed every iteration, instead of one every two iterations; the selection of the searching window and the size of the stack are now larger than previous work. These parameters allow us to improve the measure of the similarity among 3D blocks and the accuracy of the denoising method. Furthermore, we specialise the tuned-WNNM method to ultrasound images, by varying the denoising intensity through a parameter that affects the threshold of the singular values of the SVD. Increasing this parameter, the method improves in terms of removed noise, though introducing a low blurring effect. To select the best tuning for denoising intensity, we select the output image that best fits the medical requirements, among three different levels of denoising intensity (Fig. 5). In particular, Fig. 5b shows the best result as a compromise between noise removal, edges preservation, and blurring effect; in fact, it preserves the geometry of the internal tissues, while enhancing the edges of the anatomical structures.

Learning-based denoising
Qualitative results Regarding the deep learning framework, and the large ultrasound data set (Section 3.3), Fig. 6 shows the prediction results of the four networks, when tested with obstetric images (i). The predicted images are very close to the target image in all four cases; the edges and the greyscale values are well reproduced by the network. Furthermore, the predictions do not generate artefacts or spurious patterns. Varying the number of images of the training data set from 500 to 5K (Fig. 6a-d), the predicted images are  Raw data set of an obstetric district and denoised images, visualised in a scan-converted format slightly better than the target denoised images. Nevertheless, the results are good even with a small training data set of 500 images. Figure 7 shows the prediction results of the four networks when tested with muscle-skeletal images (ii). Predicting the output images with the networks trained with obstetric images (Fig. 7a-d), the results are slightly worse than the corresponding case in Fig. 6, even if the predicted images do not show any artefact of pattern repetition. These networks are trained with images from a different (i.e. obstetric) district, with different anatomical features. This result confirms that each district requires a specific network and that a single network for all the districts gives lower quality results. Table 2 reports the quantitative metrics (Section 3.2) computed between the target and the predicted images. The network trained with 5K images (d) tested with obstetric images (i) has a median PSNR and SSIM value of 36.13 and 0.964, respectively, while the same network tested with muscle-skeletal images (ii) has a median PSNR and SSIM value of 26.58 and 0.881. Both the metrics have a very slight improvement when passing from a training set of 500 to a training set of 3500 images, confirming the results of the qualitative analysis. An additional increase of the size of the training data set to 5K images further improves the quantitative results for both the test data sets. Figure 8 shows the box plot of the PSNR and SSIM metrics for four training data sets and two test data sets. Increasing the number of images of the training data set, the range of the metrics tends to decrease; this behaviour has a lower variability on the prediction of the output image. These results confirm that a network specialised in a single anatomical district reaches the best denoising quality. The prediction of muscle-skeletal images from a network trained with obstetric images highly reduces the performance of our framework; in fact, the network learns that replicates not only the denoising algorithm itself but also its adaptation to the anatomic structures and features of each district.

Single versus multiple districts
We compare the prediction results of our framework with three different training data sets. The first two data sets have 500 and 1500 ultrasound images of the same district (i.e. the obstetric one), respectively. The third one is composed of 1500 images of different districts; in particular, we select 500 images from the cardiac, obstetric, and muscle-skeletal districts. Due to the different resolutions of the images, the padding has been applied to obtain the same input resolution for each network. We evaluate the prediction results on four test data sets: the first three are composed of 50 images from the cardiac, obstetric, and muscle-skeletal districts, respectively. The fourth is composed of 50 images randomly selected from the aforementioned three districts. The prediction results ( Fig. 9 and Table 3) show that the networks trained with obstetric images (i.e. the single district networks) give the best results with the obstetric test data set: the predicted image of the single district network shows fewer scattering artefacts than the multiple districts network. Also, the single district networks have better results in terms of quantitative metrics: adding further images from different districts to the training data set worsens the results; in fact, the single district network with 500 obstetric images has a PSNR value of 35.93, while the multiple districts network has a PSNR value of 33.70.
Comparing the results on the other test data sets (e.g. cardiac Fig. 10 and Table 3), the network trained with images of multiple districts has better results than the networks trained with obstetric images only. The multiple district network better generalises on the denoising algorithm, more than on the anatomic features, thus generating fewer artefacts on the prediction. Furthermore, the multiple district network includes 500 cardiac images in its training data set, thus improving the prediction results on this district. As the main conclusion, a dedicated network for each anatomic district is the best solution for the prediction of the denoised ultrasound images of each specific district, if a sufficiently large data set is available for the training.

Execution time and computational cost
To test the training phase of the deep learning framework (Section 3.3) on the HPC framework (Section 3.4), we exploit 8 nodes, each one composed of 32 cores and 4 accelerators, for a theoretical computational performance of 260 TFLOPS, and 220 GB of memory per node. The parallel implementation of the deep learning framework and the high hardware performance reduce the computation time of the training phase by at least 100 orders less than a serial implementation on a standard workstation.
The execution time of the prediction is crucial for the real-time implementation of our framework. We test the denoising prediction on GPU-based hardware, which replicates the hardware of an ultrasound scanner currently in use. Given a set of ultrasound input images from different districts, the average execution time is 25 milliseconds; this Table 2 With reference to the four training data sets and the two test data sets (i.e. obstetric: Ob., and muscle-skeletal: Msk.) described in Section 3.3, we report the PSNR and SSIM metrics computed between the target and the prediction images, as median value among the 50 . 9 Prediction results of the obstetric district, with the networks trained with 500 (a) and 1500 (b) images from the obstetric district, and 1500 images from multiple districts (500 obstetric, 500 cardiac, 500 muscle-skeletal images) result confirms that we achieve the real-time computation target, required by the industrial constraint. We underline that the input resolution of the network is 600 × 600, which is reached through the zero-padding of each input image. The computational cost of the prediction depends on the resolution of the input image and on the architecture of the network: in particular, the computational cost of a convolution operation is O(r∕s r ⋅ c∕s c ) ⋅ (f r ⋅ f c ) ⋅ f ; in our application, the input image has a resolution of r = c = 600, the kernel-filter size on rows and columns is f r = f c = 4, the stride on rows and columns is s r = s c = 2, we use 10 convolution and 10 deconvolution operators, and a number of kernel-filters from 32 to 512.

Denoising on different learning architectures
We compare the prediction results of two different networks: Pix2Pix and the (Matlab) CNN, as part of our deep learning framework. Figure 11 shows that Pix2Pix has better results than the CNN, in terms of blurring reduction, noise removal, and edge preservation. We also compare the quantitative metrics between the target images and the predicted images, on the test data set of 50 ultrasound images of the obstetric anatomic district (Section 3.3). Pix2Pix has a PSNR average value of 36.07, and an SSIM average value of 0.878, while the CNN has a PSNR average value of 25.69, and an SSIM average value of 0.651. This result underlines that Pix2Pix outperforms the CNN as network architecture for our deep learning framework.

Discussion
Several ultrasound machines manufactured by main competitors (e.g. Esaote, Philips) are equipped with GPU cards [1, 2] Furthermore, some recent denoising methods for ultrasound images are developed on GPUs [9,22], which are also used for denoising. Furthermore, the application of GPUs to image processing for future medical ultrasound imaging systems [59] presents the advantages of GPUs over CPUs in terms of performance, power consumption, and cost.
Denoising of ultrasound images is relevant both for postprocessing and visual evaluation by medical experts. Despite some relevant works consider raw ultrasound images and videos for cardiac segmentation [38,47], several works show the benefits of denoising for segmentation [62,70,75], feature extraction [28], classification [57,65], super-resolution [31], registration [17], and texture analysis [52]. Furthermore, main ultrasound machine manufacturers include a denoising filter in their scanners [1,3]. In ultrasound denoising, the main goal is to achieve the best compromise between noise removal, features preservation, and real-time execution. The use of deep learning allows us to reach real-time performance, which is required by the clinical practice while preserving the denoising results of state-of-the-art methods, which are time-consuming. In our framework, deep learning  preserves the quality of WNNM and reaches its results in real-time. Furthermore, the deep learning for the real-time processing of ultrasound images has been applied in several works [37,58]. Our method reaches real-time performance and highquality denoising results, through a learning-based approach. In contrast, fast handcrafted methods [26] have lower results in terms of noise removal and edges enhancement; GPUbased methods [23] have higher hardware requirements than our method; other denoising methods [68] have good results in terms of noise removal, but they cannot reach a real-time implementation, due to high computational cost. Our framework also allows us to tune the denoising algorithm to obtain the best denoising results, as this tuning only affects the training phase, while the real-time computation of the denoised image is performed through the prediction of the network.

Conclusions and future work
We have presented a novel deep learning framework for real-time denoising of ultrasound images, which is general enough to be applied to different anatomical districts and noise levels. As the main contribution, the proposed realtime denoising of ultrasound images is general in terms of the input data, i.e. type of noise (e.g. speckle, Gaussian noise), the resolution and the dimensionality of the input images (e.g. isotropic/anisotropic, 2D/3D images), the acquisition methodology, and the anatomical district. We also mention its generality in terms of building blocks and parameters of the deep learning framework, i.e. the denoising algorithms (e.g. WNNM, SAR-BM3D) and the deep learning architecture (e.g. Pix2Pix, VGG19).
As future work, we plan to apply our framework to data acquired with different methodologies (e.g. 3D ultrasound, MRI), also taking into account time-dependent data (e.g. ultrasound videos). Finally, the industrial and clinical validations of the proposed framework are under development, by comparing our results with tools currently used in medical clinics.

A.1 Quantitative comparison
We compare the five selected denoising methods (Section 4) on synthetic images, by adding speckle noise with different levels of noise intensity: given a noisy image Y = X + NX, where X is the normalised ground truth image, we define the artificial multiplicative noise N(x) = √ 12 u , where u ∼ U(−0.5, 0.5) , U is a uniform distribution, σ is the noise intensity, and x is a pixel of the image.
The SIPI data set [64] is composed of 44 ground truth images of different sizes, organised in different classes (e.g. humans, landscapes). We evaluate the efficiency of the denoising methods: WNNM has very good results in terms of noise removal, edge preservation (e.g. vehicles shape (Fig. 12), and hat feathers (Fig. 13). SAR-BM3D has the best results in terms of noise removal; however, it does not correctly preserve the grey-scale values (e.g. boy's sleeve in Fig. 13) and it generates a blurred effect (e.g. grass and bushes in Fig. 12). PCA-BM3D and NCSR show minor Fig. 11 (a) Input, (b) target, (c) our prediction based on Pix-2Pix, and (d) CNN prediction for the obstetric district preservation of edges and details than WNNM (e.g. boy's face in Fig. 13). Finally, BM-CNN is not able to correctly remove the noise; this result underlines the importance of the training data set (e.g. the type and the intensity of the applied noise) when using a deep learning approach, and the necessity of using datsa-specific networks, instead of a generic-purpose one.
Concerning the metrics introduced in Section 3.1, Table 4 summarises the results of the five denoising methods on the SIPI data set; we compute the average value of each metric (i.e. PSNR and SSIM) among 44 images of the data set, and report the average values when varying the intensity of the speckle noise. SAR-BM3D has the best results under these metrics, outperforming all the other methods. The NCSR, WNNM, and PCA-BM3D methods have good and similar results in terms of PSNR and SSIM indices. These four methods show a small degradation of the metrics values when increasing the noise intensity; this result is significant Fig. 12 Input (SIPI data set, van image), noisy (speckle noise intensity σ = 0.10), and denoised images. For error metrics, we refer the reader to Table 4 Fig. 13 Input (SIPI data set, man image), noisy (speckle noise intensity σ = 0.20), and denoised images. For error metrics, we refer the reader to Table 4  for ultrasound images, which generally have a different noise intensity, according to the anatomical district, the type of probe, and the data acquisition modality. Finally, BM-CNN shows a higher degradation of the PSNR and SSIM values, when increasing the noise intensity. The quantitative analysis is useful to compare methods with numerical measures, instead of performing only a visual evaluation. However, the main comparison among methods is the qualitative evaluation performed by the medical experts on ultrasound images, through the evaluation of the speckle noise removal and the preservation of anatomical features. We underline that, even if SAR-BM3D has better results than WNNM on synthetic images, WNNM has better performance on ultrasound images. Furthermore, our framework is general enough to use different denoising methods; two different learning networks can be trained, with WNNM and SAR-BM3D, to offer the physician the comparison between two denoising results.

A.2 Tuned-WNNM
Comparing the baseline WNNM with the tuned-WNNM on synthetic images, we improve the denoising quality (Fig. 14) in terms of quantitative metrics; in fact, the output of WNNM has a PSNR value of 26.67, while the output of tuned-WNNM has a PSNR value of 26.74. Nevertheless, the execution time of WNNM is 94 seconds, while the execution time of tuned-WNNM is 260 seconds. We also compare tuned-WNNM and WNNM on the SIPI data set ( Table 5).
The aggregated results show that tuned-WNNM has slightly better performance with low noise intensity, while the results improve when the speckle noise is higher. For example, WNNM has an average PSNR value of 22.32 with a speckle noise of intensity σ = 0.3, while tuned-WNNM has a PSNR value of 22.61.