Super-resolution of 2D ultrasound images and videos

Abstract This paper proposes a novel deep-learning framework for super-resolution ultrasound images and videos in terms of spatial resolution and line reconstruction. To this end, we up-sample the acquired low-resolution image through a vision-based interpolation method; then, we train a learning-based model to improve the quality of the up-sampling. We qualitatively and quantitatively test our model on different anatomical districts (e.g., cardiac, obstetric) images and with different up-sampling resolutions (i.e., 2X, 4X). Our method improves the PSNR median value with respect to SOTA methods of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1.7\%$$\end{document}1.7% on obstetric 2X raw images, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$6.1\%$$\end{document}6.1% on cardiac 2X raw images, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$4.4\%$$\end{document}4.4% on abdominal raw 4X images; it also improves the number of pixels with a low prediction error of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$9.0\%$$\end{document}9.0% on obstetric 4X raw images, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$5.2\%$$\end{document}5.2% on cardiac 4X raw images, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$6.2\%$$\end{document}6.2% on abdominal 4X raw images. The proposed method is then applied to the spatial super-resolution of 2D videos, by optimising the sampling of lines acquired by the probe in terms of the acquisition frequency. Our method specialises trained networks to predict the high-resolution target through the design of the network architecture and the loss function, taking into account the anatomical district and the up-sampling factor and exploiting a large ultrasound data set. The use of deep learning on large data sets overcomes the limitations of vision-based algorithms that are general and do not encode the characteristics of the data. Furthermore, the data set can be enriched with images selected by medical experts to further specialise the individual networks. Through learning and high-performance computing, the proposed super-resolution is specialised to different anatomical districts by training multiple networks. Furthermore, the computational demand is shifted to centralised hardware resources with a real-time execution of the network’s prediction on local devices. Graphical abstract


Introduction
Ultrasound (US, for short) acquisition applies high-frequency sound waves to visualise soft tissues and internal organs, and support medical diagnosis for muscle-skeletal, cardiac, and obstetrical diseases.US acquisition has many advantages with respect to magnetic resonance and tomographies, such as its portability, cheapness, and non-invasiveness.Furthermore, its real-time acquisition provides instantaneous feedback to the physician, e.g., during regional anaesthe-We define the non-acquired beam lines as the intermediate lines to those acquired by the probe.These intermediate lines are not acquired to increase the acquisition time frequency but are approximated by the super-resolution scheme.Applying our approach to US videos with a low spatial resolution and a high frequency (e.g., for the cardiac district), we can generate high-frequency 2D US video with an increased spatial resolution of each frame, thus overcoming the main limits of current US probes, whose spatial resolution decreases as the acquisition frequency increases.Acquiring a 2D video with a low spatial frequency of the single image (i.e., each frame), our method reconstructs the spatially high-resolution video in real-time.
First, we compare several state-of-the-art up-sampling algorithms (Section 2) and identify the best method in terms of quantitative metrics and visual evaluation.Then, we train a neural network to improve the results of the up-sampling to match the target image (i.e., the high-resolution image).Our network does not perform the interpolation of the missing lines; in fact, this task is already performed by up-sampling.In contrast, our network learns how to transform the upsampled lines into the target lines.To improve the quality of the up-sampling, we train multiple networks, each one specialised to the input anatomical district (e.g., cardiac, abdominal) and its low-resolution image (e.g., 0.5X, 0.25X).This specialisation improves the quality of the up-sampling since we specialise the network to a specific prediction.The execution time of the super-resolution depends on the up-sampling and the network prediction; the prediction is achieved in real-time on standard medical hardware.We summarise the proposed framework (Fig. 1), where we generate the data sets within the pre-processing phase, the learningbased models within the training phase, and the real-time super-resolution prediction within the test phase.
As the main contribution (Section 3), we propose a novel learning-based architecture that accounts for convolutional layers and rectified linear unit (ReLU) activation functions (Fig. 2) and improves the Wide Activation for Efficient and Accurate Image Super-Resolution (WDSR) [45].The kernel size is selected according to the dimension of the lowresolution image to guarantee that at least two original lines (i.e., two lines that are acquired by the probe) are always included in the convolution operation.Then, we modify the loss function to improve the visual accuracy of the prediction.Our logarithmic-based loss includes only up-sampled lines, excluding lines acquired by the probe.
The proposed approach is general in terms of the building blocks of the framework; in fact, we can select different up-sampling algorithms, e.g., Single Image Super Resolution (SISR) [32], Enhanced Super Resolution Generative Adversarial Network (ESRGAN) [42] and deep learning architectures, e.g., Pix2Pix [18] and VGG19 [37].As exper-Fig. 1 Proposed framework (Section 1): training of the learning-based model and spatial up-sampling of US videos.A high-resolution image is down-sampled by removing one line (highlighted in red) each two (0.5X) or four (0.25X) and then up-sampled through the selected interpolation algorithm.Up-sampled images and the corresponding high-resolution images are the input and target to train the neural network, respectively.For the test phase, low-resolution images are acquired during ultrasound acquisition (i.e., images with a reduced number of acquired beam lines); these images are up-sampled through the interpolation algorithm and the neural network predicts the final output that is expected to be similar to the unknown high-resolution target imental validation (Section 4), we perform a quantitative and qualitative evaluation of our framework on a large collection of US images acquired from different anatomical (e.g., muscle-skeletal, obstetric, abdominal) districts.Then, we apply our method to the spatial super-resolution of US 2D US videos, and we evaluate the effects of denoising the raw images as pre-processing of our framework.Finally, we present a discussion on main outcomes (Section 5), conclusions, and future work (Section 6).Trained models and training/test code are available at https://github.com/cammarasana123/US-SuperResolution.

Learning-based US super-resolution
The main novelties of the enhanced deep super-resolution network [24] are a simplification of the conventional residual network architectures and a multi-scale super-resolution network that reduces the model size.Exploiting the sparsity of the signal in the Fourier domain, the interpolation of missing data [43] allows reconstructing the high-resolution ultrasound (HR US) image with a low computational cost.A super-resolution generative adversarial network (SRGAN) [23] applies a deep residual network with skip-connection and a perceptual loss between generated and target images.The reduction of artefacts of the previous method is addressed by the Enhanced SRGAN [42], which improves the network architecture, the adversarial and the perceptual loss, removes the batch normalisation layer, and applies the residual scaling and smaller initialisation values.
The perceptual quality of ESRGAN is improved by the ESRGAN+ method [35] through a novel Residual-in-Residual Dense Residual block, which increases the network capacity without affecting its complexity.The application of the SRGAN to US images [9] preserves both the anatomical structures and the speckle noise pattern, thus improving the perceptual quality of the upsampled images.Dilated convolution [27] extracts the internal recurrence information from the test image; this method upsamples low-resolution (LR) images when LR-HR examples are reduced.Fully convolutional U-net [40] obtains high-resolution vascular images from high-density contrast-enhanced US signals.
In [39], the deep learning method exploiting feature extraction blocks, repeating blocks, and upsampling layers apply an up-sampling factor in the range 2-8.A Self-supervised CycleGAN [26] only requires the LR US image, and generates perceptually consistent up-sampling results.Combining CycleGAN, two-stage GAN, and the zero-shot super resolution [13], it is possible to obtain super-resolution images with low blurring artefacts.

Vision-based US super-resolution
Learning-based methods suffer from artefacts and blurring when dealing with noisy signals.Several vision-based methods have been proposed, through the years.The interpolating up-sampling with cubic kernels [20] offers high accuracy with low computational cost, through appropriate boundary conditions and constraints on the kernel functions.In [3], a novel deconvolutionbased method applies the maximum a posteriori estimation to the restoration of the tissue response and is validated with several tissue-mimicking phantoms with specific scatterer concentrations.The Alternating Direction Method of Multipliers [30] is applied to the super-resolution of US images including deblurring and denoising [28] through a combination of 1 and 2 minimisation.In [44], a deconvolution method models the envelope radiofrequency and point spread function is robust to noise and does not require the knowledge of the centre frequency of the acquired signal.Assuming a Gaussian distribution for both the unknown signal to be restored and the point spread function, in [46] the reconstructed image is built through a posterior model with hybrid Gibbs sampling [15].The properties of the decimation matrix in the Fourier domain [47] are exploited to solve the super-resolution problem with a p -norm regulariser, with p ∈ [1,2].The envelope of radio frequency signal [21] applies repetitive data in the non-local neighbourhood of samples.

Device-based US super-resolution
The second harmonic image [38] contains less noise and blur than the first harmonic image.Furthermore, the lateral resolution is increased, as the harmonic pulse is auto-focused because the higher harmonics are generated in the centre of the beam.Then, the image super-resolution is achieved by combining the first and second harmonic images.Both spatial and temporal deconvolution operations [25] are achieved by accounting for the transmit and receive processes, electrical transducer characteristics, and transmit focusing laws.Combining phase-contrast imaging, angular spectral decomposition, and a super-resolution reconstruction technique [10], it is possible to recover the location and dimensions of objects smaller than the imaging wavelength.The reconstruction through generalised Tikhonov regularisation [22] is evaluated as a function of transmit-receive bandwidth and a focal number of the transducer, by comparing the results with traditional B-mode imaging.The Time-domain Optimized Nearfield Estimator [41] assumes an observation model based on the superposition of spatial responses; then, a maximum aposteriori estimation finds the distribution and amplitude of hypothetical targets that match the observed data with minimal target energy.As a further improvement, the Diffuse Time-domain Optimised Near-field Estimator [14] represents each hypothetical target in the system model as a diffuse region of targets rather than a single discrete target, thus inducing a better signal approximation.The cellular microscopy technique of multi-focal imaging [12] is applied to localise the unique position of the scatterer of the signal; three foci receive multiple overlapping curves, and a maximum likelihood estimation allows the identification of the source of the scatter.

Multi-frame US super-resolution
The Bilinear Deformable Block Matching [4], which is a registration method that accounts for the complex and deformable motion of soft tissues, is applied to reconstruct the HR image by exploiting the shifting property of the Fourier transform and the aliasing relationship between the continuous Fourier transform of the HR image and the discrete Fourier transform of LR images [29].The use of deep learning for motion estimation among different frames [2] reduces the effect of noise and artefacts and reconstructs HR images from a sequence of LR images.The modelling of the spatial correlation of the speckle noise [11] is applied to standard reconstruction methods, with tissue-mimicking phantom and co-registered multi-images.

Proposed super-resolution of US signals
The ultrasonic waves are emitted by the probe and straightly penetrate the body structures along their path; when they pass through adjacent parts of the body with a different acoustic impedance, a fraction of the ultrasound pulse returns as a reflected wave, generating an echo that returns to the probe, while the rest of the wave continues to penetrate along the beam to greater tissue depths.The amplitude of the reflected echo is proportional to the difference in acoustic impedance between two adjacent media.For example, interfaces between soft tissue and dense organs (e.g., bones) generate very strong echoes due to a large acoustic impedance gradient.The acoustic impedance is a physical property of a medium defined as the density of the medium times the velocity of the wave propagation.Human body tissues have different acoustic impedances: for example, air-containing organs (such as the lung) have the lowest acoustic impedance, while dense organs have a higher acoustic impedance.
The echo signals are processed and combined to generate the underlying image, which has a resolution of l ×d, where l is the number of beamlines (i.e., the lateral resolution), and d is the depth of the acquisition of each beam line (i.e., the axial resolution).Axial resolution refers to the ability to discern two separate objects that are longitudinally adjacent to each other in the ultrasound image; lateral resolution refers to the ability to discern two separate objects that are adjacent to each other; the lateral resolution is usually lower than the axial resolution in ultrasound.The resolution of the image in terms of lateral direction (i.e., the direction perpendicular to the US propagation along the beam line) is primarily determined by the width of the ultrasound beam and the number of elements (i.e., the piezoelectric crystals) that are activated to generate the US waves.Current probes vary the number of beamlines acquired by activating/deactivating piezoelectric crystals thus reducing lateral resolution and image acquisition time.The axial resolution can be varied by changing the length and frequency of the pulses, which affect the penetration of the ultrasound wave.
In this context, we focus on lateral low-resolution image acquisition, reduce the acquisition time, and subsequently reconstruct the high-resolution image without losing information in terms of data depth.This aspect is relevant to improve the quality of the US image, its visual interpretation by the physician, and post-processing steps, e.g., as classification [1], segmentation [5], and morphological analysis [36].

Proposed super-resolution of US images
Our framework is composed of two steps: first, we up-sample the lowresolution image through an interpolating method.After the comparison of state-of-the-art methods (Section 4.2), we select Cubic Convolution as the up-sampling algorithm.Then, we apply a learning-based network to improve the visual accuracy of the up-sampling.
For the experimental part, we consider the Esaote data set, which contains more than 10K US images at different resolutions, and is acquired from different anatomical districts (e.g., obstetric, cardiac).Given a high-resolution image (i.e., the target) acquired by the probe, we build the corresponding low-resolution image by removing one line each 2 (0.5X) or 4 (0.25X).This approach is consistent with the acquisi-tion of the US image, where the probe can acquire at the full, half, or a quarter of the maximum number of beamlines, depending on the activation of the piezoelectric crystals.We up-sample the low-resolution images through Cubic Convolution at 2X (applied to 0.5X low-resolution) or 4X (applied to 0.25X low-resolution).Then, we use the couples of upsampled and target high-resolution images to analyse the proposed framework, through the training and the prediction of the learning-based network, with a specialisation in anatomic districts.
We generate a separate training data set of 1.5K images for each anatomical district and two different up-sampling resolutions of 2X and 4X.Then, the same images are denoised through a low-rank denoising algorithm [8] to build the training data set of 1.5K denoised images for each anatomical district.In total, we train 12 networks (i.e., 3 anatomical districts, 2 up-sampling factors, 2 (raw/denoised) images), each with 1.5K images as a training data set.In addition, for each anatomical district, up-sampling factor, and raw/denoised images we generate a validation data set of 400 images and a test data set of 200 images, using each image in only one of the three data sets.
Our approach requires the interpolation of the missing rows to the up-sampling method, while the learning model deals with the prediction of the target values from the interpolated values of the up-sampling method.Given two images A and B both of size m × n, as quantitative metrics we consider the peak-signal-to-noise ratio P S N R(A, B) = 10 log 10 (max(A)) 2   M SE(A,B) , where we define the mean squared error where μ(•) is the mean of (•), σ (•) is the standard deviation of (•), σ AB is the covariance between A and B, the positive constants C 1 , C 2 and C 3 are used to avoid a null denominator.We also consider the mean absolute error , and the pointwise absolute error image |A − B| for the comparison of the high-resolution target with both the up-sampled image and the prediction of the network.We also compare the histogram of the absolute value of the prediction error to analyse the number of pixels whose error is lower than a certain threshold.

Deep learning network
We select WDSR [45], an architecture that exploits residual blocks since it improves the prediction of images where the difference between the input and the target is small.We propose a customised version of this network: custom-WDSR.In particular, our network architecture is a variant of WDSR-A, where the expansion of the features before the rectified linear unit (ReLU) activation allows more information to pass through while preserving the non-linearity of the network.After the normalisation of the data, we apply a 2D convolution and a weighted normalisation that improves the conditioning of the optimisation problem and thus the convergence.Then, we apply 8 residual blocks with wide activation where each residual block is composed of two convolution layers with ReLU activation and a final 2D convolution with a weighted normalisation layer; finally, we combine residual blocks and convolution layers and apply the denormalisation.The kernel filter size depends on the up-sampling factor: (3 × 3) in 2X up-sampling, and (5 × 5) in 4X up-sampling.The convolution layer does not need to perform the interpolation of the missing values, since this operation has already been performed by the up-sampling algorithm.For this reason, we did not implement the WDSR-B network which adds a linear low-rank convolution and neither pixel shuffling for the deconvolution operation.With this setting, the total number of trained parameters is 889K for 2X network and 253K for 4X network.
Given an y = L×D target image, and its approximation ŷ, our loss function is defined as where s controls the number of lines acquired by the sensor (e.g., s = 4 when 4X up-sampling is applied) and neglecting their contribution to the training loss; = 10 −4 avoids a null error for the logarithmic loss, and k = 5 determines the curvature of the logarithmic loss function.We enhance the pixels where the loss is less than 5 on the 0-255 range, to improve the visual similarity between the prediction and the target image.We underline that data are normalised in the range 0-1, and consequently the k value is set to 5/255 ≈ 0.019.The value of is selected sufficiently smaller than 1/255 ≈ 4 • 10 −3 , which is the normalisation of the smallest possible difference value between two pixels; we have experimentally set = 1 • 10 −4 .The size of the kernel of the convolution filter depends on the up-sampling factor; in the case of a 2X up-sampling, we apply a 3 × 3 filter; for a 4X up-sampling, we apply a 5×5 filter.This choice allows us to include at least two lines acquired by the probe, in the convolution operator.Finally, we set the number of layers to 16 and the number of kernels to 10.The learning rate iteratively decreases, up to 10 −6 , and the number of epochs is set to 200.The input and output layers of the network are #batch × L × D size.

Experimental results
We discuss the results of the proposed super-resolution of 2D US images (Section 4.1) and compare our results with previous work (Section 4.2); we present the results with 2D US videos (Section 4.3) and noisy images (Section 4.4), and discuss the execution time (Section 4.5).

Proposed super-resolution of US images
We train each learning-based network (custom-WDSR) with 1.5K images, where the input is the outcome of the selected up-sampling method (i.e., Cubic convolution), and the target is the original high-resolution image.Indeed, input and target images have the same resolution, as the reconstruction of the missing lines has been already performed by Cubic convolution.Figures 3, 4, and 5 show the results of the network prediction, compared with the input and the target images.Target images correspond to spatial high-resolution images; input images are the outcome of the up-sampling interpolation, which is applied to spatial low-resolution images (i.e., the down-sampling along the lateral direction of highresolution images); prediction images represent the output of the neural network.
Our framework visually improves the results, in terms of blurring and artefacts.This result is more evident in the magnification of the ear of the foetus (Fig. 3), the mitral valve (Fig. 4), and the mass edges (Fig. 5). Figure 6 shows the error image of the three anatomical districts with both 2X and 4X up-sampling factors, with the maximum error in the scale 0 − 255.The error is more evident in the contours of the anatomical structures; moreover, the abdominal district shows a smaller error than the cardiac and obstetric ones.We underline that the view for each image is scaled to its maximum, to improve the visualisation of the error.
Figure 7(a-b-c, left) shows the box plot of the statistics of the PSNR on three different anatomical districts, comparing the target images with the prediction and the cubic convolution, respectively.The metrics are computed on a data set of 200 images of the same district and with the same up-sampling factor.We report that the PSNR median value improves of 1.7% on obstetric 2X raw images, 6.1% on cardiac 2X raw images, and 4.4% on abdominal raw 4X images.
Figure 7(a-b-c, right) shows the histogram of the absolute value of the error with respect to the target image, of the prediction and Cubic convolution results, respectively.The histograms show the number of pixels where the prediction error is lower than 5 (i.e., the first bin of the histogram), which Fig. 4 Prediction on raw images of the cardiac district: 2X up-sampling (first line); 4X up-sampling (second line).See also Fig. 3 Fig. 5 Prediction on the raw images of the abdominal district: 2X up-sampling (first line); 4X up-sampling (second line).See also Fig. 3 means very similar to the target when visually analysing the images.From the Cubic convolution to the predicted images, this value increases of 9.0% on obstetric 4X raw images, 5.2% on cardiac 4X raw images, and 6.2% on abdominal 4X raw images.
Figure 8 shows the box plot of the SSIM (a-b-c, left) and MAE (a-b-c, right) quantitative metrics, as performed for PSNR metric.Also, these metrics show that our method improves the results of Cubic convolution both in terms of average value and variability.For example, the SSIM median value improves of 2.5% on obstetric 4X images and the MAE median value improves of 4.7% on cardiac 2X images.
The analysis of the absolute value of the difference between the input and the prediction of the network (Fig. 9) shows that the alteration of our prediction to the pixel val-Fig.6 With reference to Figs. (3,4,5), we show the error image of our method with respect to the target image with both 2X and 4X upsampling factors: obstetric district (first row), cardiac district (second row), and abdominal district (third row).For each image, we report the maximum error in the scale 0 − 255 ues ranges from 0 to a maximum absolute value of 20, mainly located on the edges of the anatomical structures; furthermore, the black uniform areas are less affected by the prediction.In terms of the distance between the input and the prediction, we do not observe a significant difference among anatomical districts and between 2X and 4X up-sampling.
We also verify the robustness of our method on images at different brightness.Characterising the brightness of an image as the average value of all pixels, we test images with high and low brightness on different anatomical districts and up-sampling factors.

Comparison with previous work
We address both the comparison among state-of-the-art algorithms that are used for the selection of the up-sampling method of our framework and the comparison of our results with previous work.Among up-sampling STAR methods, we test four methods belonging to different classes: Cubic Convolution [20], a kernel-based interpolating method; Enhanced Deep Residual Networks -EDSR [24], a learning-based method trained on generic images; Enhanced Super-Resolution Generative Adversarial Network Plus -ESRGAN+ [35], a learning-based GAN method, specialised on US images with a dedicated training; Single Image Super Resolution -SISR [32], an up-sampling method which exploits sparse representations.We evaluate the up-sampling results of the selected methods on different anatomical districts and resolutions: obstetric district with 0.5X downsampling (Fig. 12); abdominal district with 0.25X down-   12 Comparison of up-sampling methods vs. our method on the obstetric district: 0.5X low-resolution and 2X up-sampling.See also Table 1 123 Fig. 13 Comparison of up-sampling methods vs. our method on the abdominal district: 0.25X low-resolution and 4X up-sampling.See also Table 1 sampling (Fig. 13).Figure 14 shows the error image between target and SOTA super-resolution on both 2X and 4X upsampling, with the maximum error value in the range 0−255: Cubic convolution has visually the best results in terms of approximation error.Furthermore, our method improves the error image results with respect to Cubic convolution, improving the approximation of the target image, including the maximum error.All the error images of each up-sampling factor are represented with the same colour scale to better visualise the differences among the methods.
Tables 1, 2 and 3 summarise the comparison with the PSNR, SSIM, and MAE metrics on a test data set of 200 images.Cubic convolution has a mean PSNR value of 36.52 and 42.17 for 2X and 4X upsampling, respectively.According to these results, we select Cubic convolution as the best method for the up-sampling of US images.This method interpolates the missing lines, without generating artefacts.In comparison, our method improves the results of previous work (Figs. 12 and 13, Table 1), with a mean PSNR value of 37.00 and 44.35 for 2X and 4X super-resolution, respectively.Finally, we underline that 4X super-resolution on the abdominal district has better results than 2X super-resolution on the obstetric district, due to the complexity and variety of each anatomic district data set.

Proposed super-resolution of US videos
Applying our approach to US videos with a low spatial resolution and a high frequency (e.g., for the cardiac district), we can generate high-frequency 2D US video with an increased spatial resolution of each frame, thus overcoming the main limits of current US probes, whose spatial resolution decreases as the acquisition frequency increases.The relationship between image resolution and video frequency f is given by f = c/(2 • d •l), where c is the speed of sound.The acquisition of low-resolution US images allows the physician to increase the acquisition frequency.The probe acquires a reduced number of lines: we refer to 0.5X and 0.25X low-resolution images, as l/2 × d and l/4 × d resolution, respectively.We refer the reader to the uploaded video for the experimental tests on the spatial super-resolution of 2D US videos (see URL below).In the video, the input signal is a 2D US video at full resolution L × D × T with L lines, D depth, and T frames.We down-sample each image at L/2 or L/4, and apply our framework for the spatial super-resolution, to reconstruct the full-resolution 2D video.Video URL: https://www.dropbox.com/s/p42pzxxvgf9gacl/SuperResolution-US.mp4?dl=0.

Denoising and super-resolution
To evaluate the effect of denoising for the super-resolution of US images, we apply to input raw images a learning-based low-rank denoising which allows us to select a soft intensity of the smoothing.This approach generates denoised images that are visually similar to raw images, and simultaneously more uniform.Then, we generate down-sampled images (0.5X and 0.25X) and apply the Cubic convolution up-sampling.These couples of images (i.e., denoised at full resolution and up-sampled) are used to train the learningbased network (Section 3).With this approach, we verify the performance of both the up-sampling algorithm and our learning-based prediction when applied to input denoised images.
Figure 15 shows the results of the prediction of the network, compared with the input and the target denoised images of the obstetric district.Our framework visually improves the results, in terms of blurring and artefacts.Figure 16 shows the error image of our prediction with respect to the target denoised image, for both 2X and 4X up-sampling.The error is mainly distributed on the edges of the anatomical structure.Furthermore, the maximum error of the 2X up-sampling is   6 in the range of 0 − 255, showing us that our method accurately predicts the target if soft denoising is applied before up-sampling.Figure 17 (left) shows the box plot of the quantitative metrics, comparing the target images with the prediction and the Cubic convolution, respectively.The PSNR metric is computed on a data set of 200 images, belonging to the same district, and with the same up-sampling factor.Analysing the obstetric anatomical district and concerning the corresponding raw images (Fig. 7 (a, left)), the denoising allows the network to significantly improve the results of the upsampling and the prediction.In particular, comparing the target images with the predicted images, the median PSNR value of obstetric 2X denoised images is 51.8, compared to the median PSNR value of obstetric 2X raw images which is 36.9.
Figure 17 (right) shows the histogram of the absolute value of the error with respect to the target, of the prediction and Cubic convolution respectively.This result shows that our framework increase of 1.7% and 14% (2X and 4X, respectively) the number of pixels where the prediction error is lower than 5, which is very similar to the target when visually analysing the images, and improved with respect to the learning framework applied to raw images.According to Fig. 18, our method improves the accuracy of Cubic convolution.For example, the SSIM increases of 1.3% on cardiac 2X and the MAE increases of 8.2% on abdominal 4X.

Execution time and computational cost
We define an HPC implementation of the proposed framework on the CINECA-Marconi100 cluster, exploiting both CPUs (IBM POWER9 AC922) and GPUs (NVIDIA Volta V100).We design a parallel and distributed implementation in TensorFlow2, and we train multiple networks with large data sets for the target medical application.To test the training phase of the learning-based networks in the HPC environment, we exploit 8 nodes, each one composed of 32 cores and 4 accelerators, for a theoretical computational performance of 260 TFLOPS, and 220 GB of memory per node.The parallel implementation of the deep learning framework and the high hardware performance reduce the computation time of the training phase by at least 100 orders less than a serial implementation on a standard workstation.Figure 19 shows the training loss and validation PSNR.Both metrics show convergence property within 100 epochs iteration.In particular, the validation PSNR goes from a value of 41 to a value of 58 after 100 epochs.The computational cost of the prediction depends on the resolution of the input image and the architecture of the network: in particular, the computational cost of a convolution operation is O(r /s r • c/s c ) • ( f r • f c ) • f ; in our application, the input images have variable resolutions, with a maximum value of r = c = 600, the kernel-filter size on rows and columns is f r = f c = 3 on 2X applications Fig. 16 Concerning Fig. 15, we show the error image of our method with respect to the target image with both 2X and 4X up-sampling factors on obstetric denoised images.For each image, we report the maximum error in the scale 0 − 255 and f r = f c = 5 on 4X applications, the stride on rows and columns is s r = s c = 1, we use 16 convolution operators and 10 kernel filters.
We test the prediction on GPU-based hardware, which replicates the hardware of a US scanner currently in use.Given a set of US input images from different districts at different resolutions, the average execution time is 8 milliseconds.Finally, the denoise pre-processing can be performed in real-time through a learning-based method [7].

Discussion
SOTA super-resolution methods approximate the unknown values with deep-learning models that take advantage of large data sets or interpolating models that account for the neighbouring points through kernel functions.Learningbased models tend to generate artefacts while interpolating algorithms are general-purpose models that may be less accurate on anatomical districts with complex geometries and may not be robust to noise images.Our method combines the two approaches: first, we up-sample the low-resolution image through an interpolating method; then, we apply a learning-based network to improve the visual accuracy of the up-sampling on the specific anatomical district without generating artefacts and improving super-resolution results in comparison with the SOTA methods.Furthermore, neural network models such as WDSR can be used in superresolution problems not only for interpolation but also for the specialisation and fine-tuning of the results.As the main requirement for our two-steps approach, the up-sampling algorithm (i.e., the first step) must have a low execution time, to keep the entire pipeline in real-time.
Through learning and high-performance computing, the proposed super-resolution is specialised to different anatomical districts by training multiple networks.Furthermore, we can improve the offline training with new data, a-priori and/or additional information on the input data (e.g., anatomical district, image resolution, acquisition methodology/protocol).The training data set can be periodically updated with the up-sampled images after the expert validation of the superresolution results or with new data to further specialise in the individual networks.The use of deep learning on large data sets overcomes the limitations of vision-based algorithms that are general and do not fully encode the characteristics of the data.Each network is separately trained from scratch on each anatomical district and up-sampling factor.If a small data set is available for a certain anatomical district, we can train a general-purpose network and then specialise dedicated networks with a fine-tuning stage to the specific anatomical districts.
HPC is widespread for the training of learning models in US processing; for example, for the localisation of common carotid artery transverse section through RCNN [19], automatic segmentation of the carotid artery and internal jugular veins [17], fetal standard planes recognition [33], and segmentation and classification of anatomical structures [31].HPC and cloud computing also poses new challenges in terms of reorganisation of the medical analysis pipeline, where the computational demand is shifted to centralised hardware resources with a real-time execution of the network's prediction on local devices [6].

Conclusions
We introduce a novel deep learning framework for the superresolution of US images, which improves the quality of the up-sampling of a selected state-of-the-art algorithm, by training a neural network to match the target high-resolution image.Our method is tested on different anatomical districts (e.g., obstetric, cardiac, obstetric) and up-sampling factors (e.g., 2X, 4X), and it is general with respect to the up-sampling algorithm and the learning model, as long as it complies with the real-time prediction requirement.We analyse the results on 2D images and videos, on both raw and denoised signals, discussing the improvement of the denoising in terms of up-sampling accuracy, at the cost of a small loss of details on the US signal.Our method specialises trained networks to predict the high-resolution target through the design of the network architecture and the loss function, taking into account the anatomical district and the up-sampling factor and exploiting a large ultrasound data set.
In future work, we want to extend the framework to US 3D images and perform with Esaote quality department and expert radiologists a clinical validation of the method through more formalised qualitative survey and evaluation methods [16,34] through an interdisciplinary approach that involves engineering, medical science, physics, and computer science.

Fig. 3
Fig. 3 Prediction on the raw images of the obstetric district: 2X upsampling (first line); 4X up-sampling (second line).The input image (i.e., the input of the neural network) represents the outcome of the upsampling algorithm; the prediction represents the output of the neural Figures 10 and 11 show that the prediction performed with our trained network is robust to different values of image brightness, never lowering the output accuracy or generating artefacts.Comparing the input and the prediction of our network with the target image, we improve the PSNR value from 43.46 to 43.55 with high brightness images from the abdominal district 2X up-sampling, and from 31.01 to 31.48 with low brightness images from the obstetric district 4X up-sampling.

Fig. 7 Fig. 8
Fig. 7 PSNR box-plot (left) of the (a) obstetric, (b) cardiac, and (c) abdominal districts, and error histogram (right): prediction (blue) vs. input (red): 2X (first line) and 4X (second line) results.The box-plot represents the statistic of the PSNR on the 200 images test data set; the

Fig. 10
Fig. 10 Input and prediction of the raw images of the abdominal district 2X with different levels of brightness: low brightness (first row) and high brightness (second row)

Fig. 14
Fig.14 Error image of SOTA up-sampling methods vs. our method on the obstetric (0.25X low-resolution) and abdominal (4X up-sampling) anatomical district: cubic convolution (first row); SISR (second row); EDSR (third row); SRGAN (fourth row); our (fifth row).For each image, we report the maximum error value in the range 0 − 255.All the images of the same up-sampling factor (i.e., 2X and 4X) are represented with the same colour scale

Fig. 19
Fig. 19 Training (left) and validation (right) loss (y−axis) with respect to the number of epochs (x−axis)

Table 1
Concerning Figs. 12 and 13, we report the PSNR metric computed between target and up-sampling methods, as the mean value among the 200 test images.Best results in bold

Table 2
Concerning Figs. 12 and 13, we report the SSIM metric computed between target and up-sampling methods, as the mean value among the 200 test images.Best results in bold

Table 3
Concerning Figs. 12 and 13, we report the MAE metric [•10 −2 ] computed between target and up-sampling methods, as the mean value among the 200 test images.Best results in bold