SpeckleGAN: a generative adversarial network with an adaptive speckle layer to augment limited training data for ultrasound image processing

Purpose In the field of medical image analysis, deep learning methods gained huge attention over the last years. This can be explained by their often improved performance compared to classic explicit algorithms. In order to work well, they need large amounts of annotated data for supervised learning, but these are often not available in the case of medical image data. One way to overcome this limitation is to generate synthetic training data, e.g., by performing simulations to artificially augment the dataset. However, simulations require domain knowledge and are limited by the complexity of the underlying physical model. Another method to perform data augmentation is the generation of images by means of neural networks. Methods We developed a new algorithm for generation of synthetic medical images exhibiting speckle noise via generative adversarial networks (GANs). Key ingredient is a speckle layer, which can be incorporated into a neural network in order to add realistic and domain-dependent speckle. We call the resulting GAN architecture SpeckleGAN. Results We compared our new approach to an equivalent GAN without speckle layer. SpeckleGAN was able to generate ultrasound images with very crisp speckle patterns in contrast to the baseline GAN, even for small datasets of 50 images. SpeckleGAN outperformed the baseline GAN by up to 165 % with respect to the Fréchet Inception distance. For artery layer and lumen segmentation, a performance improvement of up to 4 % was obtained for small datasets, when these were augmented with images by SpeckleGAN. Conclusion SpeckleGAN facilitates the generation of realistic synthetic ultrasound images to augment small training sets for deep learning based image processing. Its application is not restricted to ultrasound images but could be used for every imaging methodology that produces images with speckle such as optical coherence tomography or radar.


Introduction
Cardiovascular diseases like atherosclerosis are the leading cause of death globally [12]. A common methodology for assessing the severity and progress of plaque building in coronary arteries is intravascular ultrasound (IVUS) as it provides information regarding the vessel wall and the composition of plaques.
In recent years, finding diagnoses has been more and more supported by algorithms which provide additional information to the physician. In particular, powerful deep learning methods gained significant importance due to their superior performance compared to many explicit algorithms. Typical applications are detection and classification of diseases or segmentation of different tissues.
A drawback is the need of large annotated training datasets in order to get useful results. Annotations are usually made by trained experts to ensure high quality. This naturally leads to a lack of high-quality data. To overcome these limitations, data augmentation methods are commonly used [18]. In addition to applying random transformations to the data samples (which do not alter their labels), the generation of artificial training data is a possible way to enlarge * Lennart Bargsten lennart.bargsten@tuhh.de 1 Institute of Medical Technology and Intelligent Systems, Hamburg University of Technology, Hamburg, Germany 1 3 the training set. One way to generate synthetic data is to run simulations. These often rely on rather simple models or require in-depth domain knowledge leading to results which are either of low quality or quite time consuming. Another promising method to generate artificial data is by training generative adversarial networks (GANs) [6]. Nevertheless, GANs also need sufficient amounts of training data to reach satisfactory performances. Often they are trained with more than 10,000 images or even 100,000 images when dealing with rather diverse datasets [15]. To reduce the amount of needed data, theory-guided operations or modules may be integrated into the neural network architecture [11]. These arise from theoretical considerations or physical models which can replace parts of the network. In this way, the amount of model capacity which would be used to learn these physical concepts is free to learn other features. In addition, theory-based network modules serve to regularize the training process and can thus lead to improved performance.
We designed such a theory-guided network module to add speckle noise to network feature maps and integrated it into a GAN architecture, which we called SpeckleGAN. This enables us to generate realistic IVUS images with very few training examples, while keeping the overall network architecture simple. Furthermore, the size of resulting speckles can vary for a single image and is learned during the training process. Finally, we show how we can improve IVUS image segmentation performance by means of pre-training a neural network with synthetic images by SpeckleGAN if only very limited data are available. Our method thus enables the training of high-capacity neural networks with few data by simultaneously prevent overfitting.

Speckle layer
Speckle is an interference phenomenon in imaging systems and occurs if the mean distance between scatterers is smaller than the resolution cell defined by the imaging methodology [2]. The size of the resolution cell is determined mainly by the wavelength of the carrier (or excitation) signal. Another condition for the developing of speckle is the presence of independent random phases of the scattered waves at the point of observation, usually generated by surface roughness (optics) or inhomogeneous volumes like tissue (ultrasound). Interference of these signals leads to characteristic speckle patterns.
The algorithm for the speckle layer resembles the one found in the appendix of [8] and is based on the principles of Fourier optics explained in [7]. In Fourier optics, one takes advantage of the fact that under certain simplifications the propagation and diffraction of wave signals can be expressed as Fourier transformations. Although the process of speckle formation differs in ultrasound systems, the resulting effect on the gray values is similar and we illustrate the approach in the context of a simple optical system.
The algorithm is based on an imaging system comprised of an illuminated rough object and a converging lens (see Fig. 1b). The propagation and focusing of the wave signal emitted by the object can be represented by two consecutive Fourier transformations. This is possible if some approximations are applied to the following general form of the diffraction integral. It describes how wave signals are diffracted at apertures and is defined as a b Fig. 1 a: Sketch showing diffraction at an aperture. Variable naming corresponds to Eq. 1. b: Sketch of a simple imaging system with a rough object and a converging lens. Due to the roughness, the object's signal exhibits a spatial distribution of random phases which leads to speckle patterns in the focal plane of the lens Here, U(x 0 , y 0 ) denotes the field amplitude in the plane of observation, U(x 1 , y 1 ) the field amplitude in the aperture plane and Σ the aperture. The vector represents the normal of the aperture plane, k is the wave number, 01 the vector between a point on the aperture plane and another point on the plane of observation and r 01 its norm. See Fig. 1a for a corresponding sketch. Further details regarding the derivation of the formula and its application to the imaging system of Fig. 1b can be found in [7]. The speckle layer imitates the optical system of Fig. 1b and can be described by the following equation: where I(x, y) and I sp (x, y) denote the source and speckled image, respectively. F represents the Fourier transformation and rect d (x, y) the rectangular window function with edge length d. For the sake of simplicity we did not use a circular window function indicated by the lens in Fig. 1. On the one hand, we did not observe any difference in the visual appearance of the resulting speckle, on the other hand the calculation of a circular mask function is computationally more expensive, because the distance between every pixel to the image center has to be calculated in every training step. Equation 2 can be interpreted as a low-pass filter of the source image which is multiplied pixel-wise with random phases and is thus equivalent to Here, * is the convolution operator and sinc d (x, y) the sincfunction with scale d. The edge length d of the rectangular window function defines the mean size of the resulting speckles and can be learned during training of the neural network. Smaller windows lead to larger speckle patches. We note that the runtime complexity of a convolution operation scales with n 2 while the fast Fourier transform (FFT) scales with n ⋅ log(n) . It is thus computationally more efficient to (1) U(x 0 , y 0 ) = 1 j ∬ Σ exp (jkr 01 ) r 01 cos ( , 01 ) U(x 1 , y 1 ) dx 1 dy 1 . (2) implement Eq. 2. In order to generate the typical speckle patterns for centric IVUS views, coordinate transforms from polar to Cartesian coordinates and vice-versa were added to the pipeline. An exemplary speckle transformation process is depicted in Fig. 2.

SpeckleGAN architecture
To generate IVUS images with defined geometry regarding the artery lumen and the intima/media layers, a segmentation mask has to be used as a conditional input. A promising way to process the segmentation masks is by using spatiallyadaptive normalization (SPADE) for semantic image synthesis [15]. SPADE layers transform segmentation masks (here, encoded as images with integer pixel values from {0, 1, 2}, where each value corresponds to a tissue class) into feature maps and by feeding them through two convolutional layers, respectively. The segmentation masks are resized before feeding them into SPADE in order to have the same size as the feature maps which should be normalized.
Pixel values x in n,c,h,w of input feature maps to be normalized are transformed as follows: where the multi-index (n, c, h, w) refers to (sample in batch, channel, height, width). The parameters c and c denote the channel-wise mean and standard deviation of x in ∶,c,∶,∶ , respectively. A colon indexes the whole tensor dimension. Figure 3 gives an overview of the overall GAN architecture. Generator and discriminator consist of multiple residual blocks [9]. In the generator, SPADE [15] layers are used to condition the generated image to a given segmentation mask. The first convolutions in all SPADE layers have 64 output channels. Batch normalization precedes the affine transformation by SPADE and is also used in the discriminator. Upscaling in the generator is performed by nearest neighbor interpolation, while downscaling in the discriminator is performed by convolutions with a stride of 2. The generator is seeded with a 128-dimensional random vector sampled from a standard multivariate Gaussian distribution. Spectral normalization [14] was applied to the generator and the discriminator. The speckle layer follows the penultimate residual layer of the generator. Here, the feature maps already reached the output image size. Inserting the speckle layer into a deeper part of the network led to poor results. One reason could be that the feature maps in deeper layers have not yet reached the original image size. The speckle layer adds speckle noise with 4 different speckle sizes to all input feature maps, respectively. This means that 8 input feature maps are transformed to 32 output feature maps, whereby 4 feature maps each exhibit the same morphology but with different speckle sizes. These hyperparameters were found by grid-search and stayed the same for all experiments. The input feature maps of the speckle layer are also used to compute channel attention coefficients by applying global sum pooling and two linear layers. The output feature maps of the speckle layer are weighted with these coefficients to filter out unimportant combinations of input feature maps and speckle sizes. A spatial attention approach led to massive checkerboard artifacts and was therefore discarded. The resulting synthetic IVUS images have a size of 256 × 256 pixels.

Dataset
The underlying IVUS dataset was provided by Balocco et al. [1] and consists of 435 IVUS images captured with a 20 MHz phased array transducer together with corresponding annotated contours marking the lumen border and the media-adventicia interface. The dataset comprises images with calcified and non-calcified plaque as well as bifurcations, side branches and shadow artifacts. The annotated contours were transformed into segmentation masks containing three different classes (lumen, intima/media and adventicia/background). Figure 4 shows an example image with the corresponding segmentation mask.

Fréchet Inception distance
The Fréchet Inception distance (FID) [10] measures the distance between the generated image data distribution and the real image data distribution by combining mean values and covariance matrices of network activations arising from feeding both image sets into an Inception-v3 model [19], which was pre-trained on the ImageNet dataset [4]. Typically, activations of the penultimate network layer are used to calculate the FID score: Here, 1 and 2 are the mean vectors and C 1 and C 2 the corresponding covariance matrices. Small FID scores and thus small distances between the image data distributions indicate visual similarity of the image sets as well as diversity of the generated image set meaning that mode collapse was prevented. It has not been proven so far that low FID scores induce high image quality when applied to medical images. However, recent works indicate correlation between FID score and realism of generated medical images [13,21]

Training
We used the non-saturating GAN loss functions proposed in [6]: where L D and L G denote the loss functions for discriminator and generator, respectively. Furthermore, denotes a real image drawn from the data distribution p data , whereas denotes a condition. In this work, is a segmentation mask. The random number is the input of the generator and is drawn from a standard multivariate Gaussian distribution p . Finally, D and G are the discriminator and generator function, respectively. For defining a baseline GAN, the speckle layer was replaced with an identity mapping (cyan-colored box in the generator sketch of Fig. 3). Everything else remained the same. SpeckleGAN and the baseline GAN were trained with 435, 200, 100 and 50 training examples, respectively. The validation during training was done by means of calculating the FID score between 435 generated images and the whole dataset of 435 real images to make all cases comparable (see "Segmentation evaluation" section for notes regarding overfitting). The GANs were conditioned with the segmentation masks of the dataset to generate synthetic images. This ensures that validation is not affected by artery morphologies, but focuses on textures.
For every combination of model and number of training examples, the best learning rate and learning rate decay scheme was grid searched individually. In summary, the initial learning rates ranged between 1e−3 and 3e−4 and were decreased to 1e−4 or 3e−5 in two steps every few hundred epochs. For optimization we used Adam with 1 = 0.5 , 2 = 0.999 and = 1e−8 . During training, data augmentation was performed by random rotations as well as horizontal and vertical flips. The edge lengths of the square filter windows defining the speckle sizes in the speckle layer were initialized with values ranging from 28 to 48 pixels.

GAN evaluation
The final evaluation was done by means of calculating the FID score between 1000 generated images and all 435 real images. For generating the synthetic images, the generators were conditioned with artificial segmentation masks produced by superimposing randomly rotated and disturbed ellipses imitating artery lumen and intima/media layers. This approach simulates the way how GANs would be used in practice, namely to augment the dataset they were trained with. As explained in "Fréchet Inception distance" section, the FID-score does not completely ensure reliability when used to evaluate realism of medical image sets. In order to further assess the quality of the synthetic images, we calculated two more metrics: The Jensen-Shannon divergence between gray value distributions of different segmentation classes in ground-truth and synthetic images and the structural similarity (SSIM) index between corresponding ground-truth and synthetic images.

Segmentation evaluation
The generated IVUS images were used to improve segmentation performances of neural networks with U-Net architecture [16]. The networks consisted of residual blocks [9] in the down-and upsampling path. In each of the three downsampling blocks, the spatial sizes of the feature maps were halved while the numbers of feature maps were doubled up to 256. The upsampling blocks operated vice versa. The input image dimensions were 256×256 and the batch size was 10.
To show that the use of synthetic IVUS data by Speck-leGAN improves segmentation performance when dealing with small datasets, we went through two scenarios: To get representative performance statistics for the segmentation, we used the remaining examples from the whole dataset (385 for scenario 1 and 335 for scenario 2) as a test set. We used the training sets to train SpeckleGANs and baseline GANs for data augmentation (we did not use the GANs from "GAN evaluation" section). Because of the small datasets in both scenarios, we used the whole training set as a reference set for monitoring the FID score during training and for finally choosing the model which is used to generate the synthetic images for segmentation pre-training. This means that the GANs will tend to overfit on the training set. However, when dealing with extremely small datasets, another split would reduce the amount of data too much in order to get useful results. Furthermore, it is not well studied so far how overfitting via FID scores quantitatively affects GAN performance. In the paper which introduces the FID score [10], the authors also use the training set as a reference set for calculating the FID score. The best performing GANs each generated 1000 IVUS images by using synthetic segmentation masks as conditional inputs (compare "GAN evaluation" section). The segmentation networks were then pre-trained with the synthetic IVUS data and fine-tuned with the real training data. We used the Dice coefficient and the modified Hausdorff distance [5] to measure the segmentation performances via fivefold cross-validation. The modified Hausdorff distance allows meaningful evaluation of edge alignment for pixel mask-based segmentation results, because it is less sensitive to outliers. The final results were calculated by means of the remaining test sets.

Generation of synthetic IVUS images
The chart in Fig. 5 Table 1 shows the GAN performances by means of Jensen-Shannon divergence and SSIM calculated between synthetic and real images. The results are broken down into the number of GAN training examples. Figure 6 gives an overview of generated IVUS images for varying numbers of training examples. In all cases, Speck-leGAN generates visually more appealing images than the baseline GAN. The quality of SpeckleGAN images only decreases slightly with fewer training examples, whereas the quality of images generated by the baseline GAN decreases strongly. Table 2 shows the segmentation results of both scenarios described in "Segmentation evaluation" section with and without pre-training by means of synthetic images generated by SpeckleGAN and the baseline GAN. The upper table presents the Dice coefficients, whereas the lower table presents the modified Hausdorff distances. We performed t tests in a pairwise fashion to check if the means differ significantly. We note that p value correction for multi-hypothesis tests must not be applied in this setting, because we do not perform multiple tests on the same dataset nor do we test one and the same hypothesis on several datasets. The corresponding p values are depicted in the four rightmost columns. A low value (typically p < 0.05 ) indicates a significant difference in the calculated mean values of the underlying segmentation metrics.

Generation of synthetic IVUS images
Keeping in mind that the FID score measures the structural similarity of two image sets and their respective diversity,    Fig. 6). This can be explained by   [20] used 2075 images of the same clinical IVUS dataset (without segmentation masks) for training a two-stage GAN in order to generate synthetic images. Our approach results in Jensen-Shannon divergences which are one order of magnitude below the values achieved in [20], even for only 100 training examples. In particular, the values obtained for the adventitia layer are far superior, which shows that our approach results in speckle patches leading to gray value distributions resembling the real ones very closely. This could be due to the ability of our algorithm to produce speckles with various sizes over a single image. But also the baseline GAN performs better than the approach in [20] regarding intima/media and adventitia layers when trained with 100 or more samples.
GANs often suffer from mode collapse [17]. This means that only a few or even only a single mode of the data distribution can be generated, which reduces the variety of the samples drastically. SpeckleGAN has the advantage that mode collapse can only affect the morphology (or background) of the image and not the speckle patterns, because these are randomly generated by the speckle layer.

IVUS segmentation
It has been demonstrated (see Table 2) that pre-training improves the mean Dice coefficient and the mean modified Hausdorff distance regardless of using synthetic images generated by SpeckleGAN or by the baseline GAN. But the improvements due to the baseline GAN are only statistically significant for 50 training examples, not for 100 training examples. In nearly all cases, pre-training with synthetic images of SpeckleGAN leads to better mean segmentation performances than pre-training with images from the baseline GAN. However, the improvement is not statistically significant in three cases of 50 training examples: for the Dice coefficient of the lumen as well as for the modified Hausdorff distance for both intima/media and lumen. It can be seen that pre-training with low quality images from the baseline GAN also improves the resulting Dice coefficients. This indicates that valuable information is even present in the morphology of blurred images.
The evaluation of the Jensen-Shannon divergence in Table 1 and the comparison with [20] shows that the structure of the adventitia in particular benefits from Speckle-GAN. However, its appearance is only of minor importance for the segmentation of lumen and intima/media. The baseline GAN achieves much worse Jensen-Shannon divergences for the lumen. Nevertheless, for 50 training examples the lumen segmentation performance is equivalent or even better by pre-training with images of the Baseline GAN. This leads to the conclusion that realistic speckle does not play an important role for segmentation of the lumen when dealing with 20 MHz IVUS images. Comparing [1,3,22] and the results of scenario 2, it can be seen that our approach nearly reached state-of-the-art performance, although our training set was smaller and no special care was taken about optimization of the segmentation network used in this work (see "Segmentation evaluation" section).

Conclusion
SpeckleGAN improves quality and diversity of generated IVUS images compared to a baseline GAN model without a speckle layer. It generates visually appealing images with defined morphology (conditioned by segmentation masks) even when trained with extremely small datasets of 50 images. SpeckleGAN offers a wide range of possible applications. First of all, it is not limited to generate IVUS images. It could be applied to ultrasound images in general and to other imaging modalities that produce images with speckle such as optical coherence tomography or radar. As seen in the previous section, realistic speckle patterns have only minor impact on the performance when it comes to segmentation of lumen and intima/media layers in IVUS. Classification, detection or tracking tasks which heavily rely on speckle patterns could benefit much more from realistic speckles generated with SpeckleGAN when tackled with data driven algorithms.
Ethical approval This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent Not applicable.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.