1 Introduction

Ultrasound has become an indispensable imaging technology in current clinical medicine. Compared with commonly used imaging modalities, such as magnetic resonance imaging (MRI), computed tomography (CT), and positron emission tomography/single-photon emission computed tomography (SPECT/PET), ultrasound imaging is real-time, portable, inexpensive, and free of ionizing radiation risks [1]. Over the past decades, ultrasound imaging has gone through rapid development and presented two trends. Since the major limitation of ultrasonography is the relatively low imaging quality, considerable effort has been made to improve the resolution and contrast, or reduce the artifacts of the ultrasound imaging [2,3,4]. Benefiting from major development of imaging techniques, the ultrasound imaging quality has obtained considerable improvement. Developed imaging algorithms or signal processing processes, however, naturally lead to more complex and expensive ultrasound equipment. On the other hand, in order to take full use of the portability of ultrasound imaging, the other trend is to minimize or simplify the imaging equipment for the wide range of applications such as family examination or health care in extreme environment [5,6,7]. Due to the size limit of the equipment, the imaging quality is further degraded in portable ultrasound imaging system.

Conventional ultrasound imaging systems usually weight hundreds of kilograms. Most of ultrasound imaging systems equipped with wheels, which allows a certain kind of mobility. For example, bedside ultrasound has been used in biopsy guidance [8], intraoperative navigation [9], and obstetrical monitoring [10]. The large size and heavy weight of conventional ultrasound equipment, however, prohibit the usage of ultrasound imaging in other out-of-hospital arenas. With recent advances in integrated circuit, it appeared lightweight pocket-size ultrasound imaging devices, which have been used in emergency aid on the accident scene, disaster relief, military operation, health care in spaceship, family doctor scenario, etc. [11, 12]. Despite the wide application possibilities, the poor imaging quality became a major limitation of the portable ultrasound imaging system. Therefore, it is of great interest to improve the imaging quality of portable ultrasound equipment.

There are three aspects concerning ultrasound imaging quality, namely, spatial resolution, contrast, and noise level. Compared with the traditional normal-size imaging devices, portable equipment typically produces images with lower spatial resolution, lower contrast, and greater noise. The poor imaging quality not only hinders the doctors from giving confident diagnosis but also misleads the doctors in making wrong decision or operation in emergency treatment. As a result, the poor imaging quality has become the major obstacle to the development and further application of portable ultrasound equipment. This motivates us to propose an ultrasound image reconstruction method to improve the imaging quality of portable equipment in terms of resolution, contrast, and noise reduction.

In the last three decades, many methods have been proposed for quality improvement in ultrasound imaging. Beamforming is a commonly used method to improve lateral/axial resolution or contrast of the imaging. By creating spatial selectivity of signals received from or sent to a transducer array, beamforming can produce imaging signals with narrow main-lobe, suppressed side-lobes, dynamic focus, and reduced speed of sound errors [13]. The representative beamforming algorithms include delay-and-sum (DAS), minimum variance (MV) [14, 15], and eigenspace-based MV (ESMV) [16, 17]. Adaptive beamforming such as MV or ESMV is able to provide resolution and contrast improvement around 30% or above than traditional DAS beamforming [18]. Besides beamforming methods, some deep learning methods are introduced recently to reconstruct images from radio frequency (RF) signals. In [19], Nair et al. designed a fully convolutional neural network to segment anechoic cysts directly from RF signals without beamforming. Luchies et al. [20] reconstructed ultrasound images from RF signals with a deep neural network and observed a better reconstruction quality compared with DAS beamforming. Although advanced beamforming methods and deep learning methods based on RF signals could improve the imaging quality successfully, this group of methods involve the complex calculation on RF signals which are hardly obtained in commercially available ultrasound imaging equipment.

Compared with the beamforming methods on RF signal domain, image reconstruction methods on image domain are more convenient and versatile. Yang et al. [21] used a variation of pixel compounding method to reconstruct a high-resolution image from a sequence of ultrasound images acquired with random motion. Taxt and Jirík [22] proposed a noise robust devolution method that deconvolved the first and second harmonic images separately, resulting in higher resolution images with reduced speckle noise. In [23], Chen et al. proposed a compressive deconvolution framework to reconstruct enhanced RF images by optimization method of the alternating direction method of multipliers. The main assumption that has been used in [23] is that point spread function (PSF) of the ultrasound imaging is spatially invariant and can be estimated from RF signal. As a result, for reconstructing a high-quality image, image reconstruction–based methods usually need more information besides a low-quality image. A few measurements with random motion, fundamental and harmonic images, and even parameter estimation from RF signal are required in [21,22,23], respectively.

Speckle noise reduction is also an important aspect of image quality improvement since the existence of speckle considerably lowers the image contrast and blurs image details. Many speckle reduction techniques have been proposed such as frequency/spatial compounding [24, 25], spatial filtering [26, 27], and multiscale analysis [28, 29]. Although some methods are able to reduce speckle noise effectively and helpful to image analysis task such as image segmentation, registration, and object detection, this group of methods always tries to strike a balance between noise reduction and detail preservation. Furthermore, speckle noise reduction has no contribution to the resolution which is the most important index of imaging quality.

Abovementioned methods such as beamforming, image reconstruction, and noise reduction tend to focus on one or two aspects of image quality. In this paper, we tried to improve image quality in all aspects by using deep learning method to generate high-quality images. Previous works proved that deep learning methods can be applied to medical image generation [30,31,32] and usually outperform traditional methods. Specifically for ultrasound image reconstruction, comparing with normal-size ultrasound imaging equipment, the imaging quality of portable equipment is degraded in resolution, contrast, and heavy noise jointly. To reconstruct a high-resolution ultrasound image from a low-resolution one is similar to an image-to-image translation task [33]. We followed some previous works dealt with similar problems using GAN (Generative adversarial networks) in this paper. An image reconstruction method based on GAN was proposed to break through the imaging quality limitation of portable ultrasound devices. For the task at hand, the GAN-based method has the following advantages: (1) Multi-level nonlinear mapping relationship between the low-quality images and the high-quality images can be extracted by learning model, which therefore has potential to improve imaging quality in multiple aspects. (2) The feature extractors are automatically learned from actual ultrasound images, not human designed, and therefore more representative and adaptive to data. (3) The discriminator used in GAN is able to improve imaging quality visually. (4) Once the model is trained, the reconstruction procedure is a one-step and feedforward process, which is therefore more direct and efficient than some other methods that involve iterated calculations, also more suitable for real-time ultrasound image processing. (5) The fast developing technology on hardware implementation of neural networks allows our method to be implemented on small and portable hardware like FPGAs, and thus easily incorporated into the current portable ultrasound equipment.

The rest of the paper is organized as follows: Methods describes the proposed method and the experimental data, Experiments shows the experimental results, and Results and discussion concludes our work.

2 Methods

2.1 Network architecture

In this study, we proposed a GAN model to reconstruct high-resolution images for portable ultrasound imaging devices. The network architectures of the GAN models used in this study are shown in Fig. 1.

Fig. 1
figure 1

Flow chart of the reconstruction algorithm

2.1.1 Generator with sparse skip connections

There are many choices to build the generator of a GAN. One choice is to use an encoder-decoder model as the GAN generator [34]. In an encoder-decoder model, an encoder is defined as a 3 × 3 convolution layer followed by an activation layer and a batch-normalization layer. The stride of the convolution is [2, 2] in order to downsample the image. The notation [x, y] means that for a two dimension convolution, the stride is of the convolution kernel is x and y for the first and second dimension respectively. A decoder has a similar structure to the encoder, but uses a 4 × 4 deconvolution instead in order to avoid checkerboard artifacts [35]. The stride of the deconvolution is [2, 2] in order to upsample the image. An encoder-decoder model usually uses the same number of encoders and decoders. An input image is first downsampled by encoders, and then upsampled to restore the original size by decoders. The encoder-decoder model we used in this paper is given in Fig. 2(a). It has three encoders and decoders. The input of the model is a 128 × 128 one channel patch. The activation function we used is leaky ReLU.

Fig. 2
figure 2

An encoder-decoder model (a) and a U-Net model (b); Our SSC U-Net model (c)

However, as is demonstrated in [33], a bottleneck exists in the structure of the encoder-decoder model, which limits the sharing of low-level information between input and output. In [33], the authors proposed a U-Net model to allow more low-level information to pass from the input to the output. To do this, the U-Net model adds skip connections between mirrored encoders and decoders compared with an encoder-decoder model. The U-Net model we used is given in Fig. 2(b). The U-Net model has been successfully applied to the super-resolution reconstruction of many medical images, such as MRI [30, 32], plane wave ultrasound images [36], and CT [37]. However, as low-quality ultrasound images have many speckles and artifacts, applying a U-Net model to the ultrasound image super-resolution reconstruction task raises a new issue: sharing all low-level information between the input and the output will bring speckles and artifacts in low-resolution images into reconstructed high-resolution images. This is because there is a shallow feedforward path through the top skip connections which extract few features from the input images.

In order to maintain the structure information in the low-resolution image, and meanwhile not bring speckles and imaging artifacts in the low-resolution image to the high-resolution one, we designed a new generator which only concatenated the output of the third encoder to the input of the third decoder. This design keeps the benefits of the U-Net while reduces the low-level information parameters. We call our model a sparse skip connection U-Net (SSC U-Net). Our network is shown in Fig. 2(c).

2.1.2 Discriminator and training strategy

The discriminator uses local patches of the origin image during the training process. This strategy is based on the assumption that pixels from different patches are independent. It encourages the discriminator to model high-frequency details [38]. Other benefits of local patching are that it helps enlarge the data set and save memory resources during the training process. This idea is commonly accepted in tasks like image style transference [39]. The patch size in our network is 128 × 128.

Our training strategy follows the approach in [40]. We train the generator first. The discriminator is trained then with the real images and the images generated by the generator. We use Adam solver in our method.

2.2 Objective

A cross entropy loss is usually used in GAN training. We call this loss adversarial loss:

$$ {\displaystyle \begin{array}{l}{L}_{GAN}\left(G,D\right)={E}_{x,y\sim {P}_{data}\left(x,y\right)}\left[\log D\left(x,y\right)\right]+\\ {}\kern5.5em {E}_{x\sim {P}_{data}(x)}\left[\log \right(1-D\left(G(x)\right)\Big]\end{array}} $$
(1)

where x refers to the input vector and y refers to the output vector. D refers to the discriminator and G refers to the generator. L refers to the loss function and E refers to expectation. The adversarial loss is used to let the generator generates images as close to the real images as possible.

In our task, we needed to reconstruct a high-resolution image according to the input low-resolution images. This is a supervised learning, so we implemented L1 loss in our model to maintain pixel-wise similarity [41]. Previous work [42] confirms that it is possible to mix traditional loss with adversarial loss in GAN training. L1 loss helps stabilize the training and preserves low-frequency information from images. L1 loss defined as follows:

$$ {L}_{L1}(G)={E}_{x,y\sim {P}_{data}\left(x,y\right)}\left[{\left\Vert y-G(x)\right\Vert}_1\right] $$
(2)

However, L1 loss will result in blurring in output [41]. Texture and speckle noises, for example, are likely to be blurred due to L1 loss. While L1 loss is essential to keep low-frequency information, we introduced a differential loss to preserve the sharpness of edges in the generated images. The differential loss is defined as the absolute difference between the sum of the vertical and horizontal gradient of the generated image and the real image. The differential loss is given as follows:

$$ {L}_{diff}(G)={E}_{x,y\sim {P}_{data}\left(x,y\right)}\left[\sum \limits_{i=1,2}|\frac{\partial y}{\partial {x}_i}-\frac{\partial }{\partial {x}_i}G(x)|\right] $$
(3)

Our final objective is:

$$ {G}^{\ast }=\arg \underset{G}{\min}\underset{D}{\max }{L}_{GAN}\left(G,D\right)+\alpha {L}_{L1}(G)+\beta {L}_{diff}(G) $$
(4)

In our experiments, we choose α = 100 and β = 80 in Eq. (4).

3 Experiments

3.1 Training datasets

Three datasets including 50 pairs of simulation, 40 pairs of phantom, and 72 pairs of in vivo images were used to train and test the GAN model. High-quality and low-quality images of phantom and in vivo data are generated from different devices and hence need to be registered. We align the images using a non-rigid image registration algorithm introduced in [43]. Mutual information is used as the similarity metric [30].

A total of 50 pairs of simulation data are generated by Field II ultrasound simulation program [44, 45]. All simulated data was generated by plane wave transmission. Two simulation models including cysts and fetus are used. The images of cyst phantoms are simulated with the following setting: number of transducer elements = 64, number of scanning lines = 50, number of scatterers = 100000 and central frequency = 3.5 MHz, 5 MHz, and 8 MHz. We simulated 20 images for each central frequency. The images of 3.5 and 5 MHz central frequency are used as low-quality images. The images of 8 MHz central frequency are averaged to get a high-quality image. For the fetus phantom, we simulate 10 low-quality images with central frequency = 5 MHz, number of transducer elements = 64, number of scanning lines = 128, and number of scatterers = 200000, and 10 high-quality images with central frequency = 8 MHz, number of transducer elements = 128, number of scanning lines = 128, and number of scatterers = 500000. According to [44, 45], these settings are able to simulate fully developed speckles. In all, we obtain 50 pairs of simulated images with 40 pairs of cysts and 10 pairs of fetus.

A total of 40 pairs of phantom data are generated from Vantage 64TM research ultrasound system (Verasonics, Inc., USA) and mSonics MU1 (Youtu Technology, China). The phantoms include two CIRS phantoms (Computerized Imaging Reference Systems, Inc., USA): ultrasound resolution phantom (model 044) and multi-purpose multi-tissue ultrasound phantom (model 040GSE), and two self-made pork phantoms. Verasonics Vantage 64 programmable ultrasound system is used for high-quality images and Msonics MU1 handheld ultrasound scanner for low-quality ones. The setting of Verasonics ultrasound system is as follows: central frequency = 7 MHz, dynamic range = 50 dB with multi-angle plane wave compounding method (20 angles from − 16 to 16°) and a 40-mm wide L11-4v transducer with 128 elements. The setting of Msonics MU1 handheld ultrasound scanner is as follows: central frequency = 7 MHz, gain = 70 dB, using a 40-mm wide L10-5 transducer with 128 elements. In all, we acquire 40 pairs of phantom images, 25 for CIRS phantoms, and 15 for pork phantoms.

Seventy-two pairs of in vivo data are generated from Toshiba Aplio 500 (Toshiba Medical Systems Corporation, Japan) and mSonics MU1 (Youtu Technology, China). Ultrasound images of thyroid from 50 subjects and images of carotid from 22 subjects are scanned. The clinical machine of Toshiba Aplio 500 with central frequency = 7.5 MHz and gain = 76 dB is used to generate the high-quality images. The parameter for portable machine Msonics MU1 is set as follows: central frequency = 6 MHz, gain = 95 dB, using a 40-mm wide L10-5 transducer with 128 elements. The focal depth of the ultrasound image acquisition is around 1 to 2 cm for in vivo data.

Examples of experimental data are shown in Fig. 3. The left and right column in Fig. 3 shows the low-quality and high-quality images respectively. From top to bottom are examples of simulation, phantom, and in vivo data respectively, two examples for each category.

Fig. 3
figure 3

Examples of experimental data

3.2 Performance metrics

To evaluate the image reconstruction performance, we calculate peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), and mutual information (MI) for all three datasets. Full width at half maximum (FWHM) of point targets and contrast resolution (CR) are calculated for simulated point object images and cyst images respectively.

PSNR measures the similarity of two images. If one is the ground-truth image, it measures the quality of the other one:

$$ \mathrm{PSNR}=10\times {\log}_{10}\left(\frac{{\left({2}^n-1\right)}^2}{\mathrm{MSE}}\right) $$
(5)

where MSE is the mean squared error between reconstructed and ground-truth high-quality images, and for uint8 images, n = 8. The higher PSNR indicates the higher intensity similarity between reconstruction image and high-quality image.

SSIM measures the similarity of two images. It is defined as follows:

$$ \mathrm{SSIM}=\frac{\left(2{\mu}_a{\mu}_b+{c}_1\right)\left(2{\sigma}_{ab}+{c}_2\right)}{\left({\mu_a}^2+{\mu_b}^2+{c}_1\right)\left({\sigma_a}^2+{\sigma_b}^2+{c}_2\right)} $$
(6)

where a and b are windows on the reconstructed and ground-truth high-quality images respectively, μa and μb are average values of a and bσa2 and σb2 are variances of a and b, and σab is the covariance between a and b. c1 and c2 prevent the denominator from being zero. c1 = (k1L)2, c2 = (k2L)2 where k1 = 0.01 and k2 = 0.03. The windows are isotropic Gaussian function with standard deviation of 1.5. The higher MSSIM represents the higher structure similarity between reconstruction image and high-quality image.

MI indicates the mutual dependence between two images. It is defined as follows for a uint8 image:

$$ \mathrm{MI}\left(\mathrm{I},\mathrm{J}\right)={\sum}_0^{255}{P}_{IJ}(m)\log \frac{P_{IJ}(m)}{P_I{P}_J} $$
(7)

where PI and PJ represent the marginal probability distributions of I and J and PIJ is the joint probability distribution of I and J.

CR is a measure of ability to distinguish difference in intensity in an image:

$$ \mathrm{CR}=\frac{\left|{S}_A-{S}_B\right|}{S_A+{S}_B} $$
(8)

where SA and SB represent the average intensity values in two regions of interests (ROIs) with the same size. The higher CR represents the higher imaging contrast.

Since FWHM, CR is only applicable for simulation data, and PSNR, SSIM, and MI are used to evaluate the performance of algorithms in phantom data and in vivo data. The following tables in this part record the results of 5-fold cross-validation experiments. Each dataset is randomly divided into five groups. One group is used as test data while other groups are used as train data each time.

3.3 Implementation

We implement our reconstruction model by Tensorflow. We use a Titan Xp Graphic Processing Unit (GPU) for training. We generated a 128 × 128 patch every ten pixels from an original image. The total number of data patches is 68,450 pairs of simulation patches, 38,480 pairs of phantom patches, and 73,736 pairs of in vivo patches. The input images are uint8 images. We rescaled their pixel intensity values to [− 1, 1]. The mean value of the training dataset is minused from the input data. Output images are added by the mean value of training dataset. Intensity values of the output images are clipped to [− 1, 1]. The images are then transformed into uint8 grayscale images. Learning rate of training is set to 0.00005 and follows an exponential decay to 0.000005. β1 for Adam solver is set to 0.9.

4 Results and discussion

In this section, we report the results from the three models we tested: the encoder-decoder model, the U-Net model, and our SSC U-Net on simulation, phantom, and in vivo data.

4.1 Simulation data results

The simulation results are shown in Table 1. The SSC U-Net model out-performed the encoder-decoder model and the U-Net model with higher PSNR, MI, and CR and smaller FWHM. Though the U-Net model had the highest SSIM, we can observe that the generated images of the U-Net model were over-smoothed. This means that some high-frequency information was lost. The errors of the generated images and ground-truth images were averaged by over-smoothing and thus achieved a better SSIM score. Figure 4 gives examples of simulated cysts images. The point object in the red box reconstructed by SSC U-Net had a better resolution than other methods. Figure 5 gives examples of simulated fetus images. Details are clearer in the result generated by SSC U-Net.

Table 1 Results of the three models on simulation dataset
Fig. 4
figure 4

Results on simulation dataset. On the first line from left to right are low-quality image, encoder-decoder result, U-Net result, SSC U-Net result, and high-quality image. The second line shows the image in the corresponding red box in the first line

Fig. 5
figure 5

Result examples of simulated fetus images. On the first line from left to right are low-quality image, encoder-decoder result, U-Net result, SSC U-Net result, and high-quality image correspondingly. The second line shows the image in the corresponding red box in the first line

4.2 Phantom data results

Table 2 shows the average performance of the three models on the phantom dataset. Our model maintained the highest PSNR and MI. The SSIM of the U-Net model is 1.6% more than our method. However, the U-Net model kept its drawbacks of over-smoothing. As is mentioned in section 4.1, over-smoothing may lead to a higher SSIM. Figure 6 gives an example of the CIRS phantoms results. The SSC U-Net model reconstructed the point objects well while other methods blurred these point objects. Figure 7 gives an example of pork phantoms results. The point object in the red box reconstructed by SSC U-Net had the highest resolution.

Table 2 Results of the three models on phantom dataset
Fig. 6
figure 6

Result examples of cyst phantom images. On the first line from left to right are low-quality image, encoder-decoder result, U-Net result, SSC U-Net result, and high-quality image correspondingly. The second line shows the image in the corresponding red box in the first line

Fig. 7
figure 7

Result examples of pork phantom images. On the first line from left to right are low-quality image, encoder-decoder result, U-Net result, SSC U-Net result, and high-quality image correspondingly. The second line shows the image in the corresponding red box in the first line

4.3 In vivo data results

Table 3 presents the results tested on the in vivo dataset. Our SSC U-Net achieved the best performance on PSNR, SSIM, and MI. Figures 8 and 9 give two examples of the generated images. Encoder-decoder models failed to generate a satisfying result on this dataset. Images generated by the U-Net smoothed some important details. For example, the calcification point in Fig. 8 and the cyst in Fig. 9 were blurred by the U-Net. In comparison, the result of our SSC-Net was better visually and kept more details.

Table 3 Results of the three models on in vivo dataset
Fig. 8
figure 8

Results of in vivo thyroid data. On the first line from left to right are low-quality image, encoder-decoder result, U-Net result, SSC U-Net result, and high-quality image correspondingly. The second line shows the image in the corresponding red box in the first line

Fig. 9
figure 9

Results of in vivo carotid data. On the first line from left to right are low-quality image, encoder-decoder result, U-Net result, SSC U-Net result, and high-quality image correspondingly. The second line shows the image in the corresponding red box in the first line

4.4 Differential loss

The differential loss was introduced in our network to maintain sharpness of edges. The impact of the differential loss was investigated in an ablation experiment. We chose a horizontal line that crossed the center of a point object and plotted the intensity of pixels on the line in Fig. 10. We conclude from the images that with the help of differential loss, the reconstructed images achieve a higher resolution.

Fig. 10
figure 10

The left column from top to bottom are a low-quality image, an image generated without differential loss, an image generated with differential loss, and a high-quality image. The right column shows the intensity on the red line of the corresponding image. Intensity values range from 0 to 255

5 Conclusion

Image quality is vital for portable ultrasound imaging devices. In this paper, we proposed a new generative model called SSC U-Net to improve the image quality of portable ultrasound imaging devices. We tested our model on three datasets: simulation data, phantom data, and in vivo data. We compared our results with two other widely used GAN models: the encoder-decoder model and the U-Net model. Our experiment results show that our SSC U-Net model out-performed two other models in general on all three datasets. Images generated by our SSC U-Net had a better resolution and preserved more details than the other two methods.