Echo from Noise: Synthetic Ultrasound Image Generation Using Diffusion Models for Real Image Segmentation

Stojanovski, David; Hermida, Uxio; Lamata, Pablo; Beqiri, Arian; Gomez, Alberto

doi:10.1007/978-3-031-44521-7_4

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14337))

Included in the following conference series:

International Workshop on Advances in Simplifying Medical Ultrasound

1934 Accesses
3 Citations

Abstract

We propose a novel pipeline for the generation of synthetic ultrasound images via Denoising Diffusion Probabilistic Models (DDPMs) guided by cardiac semantic label maps. We show that these synthetic images can serve as a viable substitute for real data in the training of deep-learning models for ultrasound image analysis tasks such as cardiac segmentation. To demonstrate the effectiveness of this approach, we generated synthetic 2D echocardiograms and trained a neural network for segmenting the left ventricle and left atrium. The performance of the network trained on exclusively synthetic images was evaluated on an unseen dataset of real images and yielded mean Dice scores of $88.6 \pm 4.91$, $91.9 \pm 4.22$, $85.2 \pm 4.83$% for left ventricular endocardium, epicardium and left atrial segmentation respectively. This represents a relative increase of 9.2, 3.3 and 13.9% in Dice scores compared to the previous state-of-the-art. The proposed pipeline has potential for application to a wide range of other tasks across various medical imaging modalities.

This work was supported by the Wellcome/EPSRC Centre for Medical Engineering [WT203148/Z/16/Z] and by the National Institute for Health Research (NIHR) Biomedical Research Centre at Guy’s and St Thomas’ NHS Foundation Trust and King’s College London. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health.

You have full access to this open access chapter, Download conference paper PDF

UltraGAN: Ultrasound Enhancement Through Adversarial Generation

Domain and Geometry Agnostic CNNs for Left Atrium Segmentation in 3D Ultrasound

LOTUS: Learning to Optimize Task-Based US Representations

Keywords

1 Introduction

1.1 Background and Motivation

Echocardiography (echo) is the most widely used method for evaluating the heart, because it is more cost-effective and safer than other imaging modalities, while providing high resolution images in real-time. However, a major drawback of echo is its heavy dependence on operators’ expertise to obtain high-quality images and associated anatomical and functional measurements.

Convolutional neural networks (CNNs) have shown great potential for automating medical image analysis tasks and are capable of accurately learning complex relevant features from large sets of data. A major challenge in the use of CNNs for medical imaging tasks is the need for labelling such large sets of data for model training. Further, CNN accuracy can be limited by the quality of the labels used during training, particularly in the presence of noise or other artefacts that can lead to inter-observer errors. Experienced cardiologists have been shown to have inter-observer errors up to 22% when labelling common measurements in echo [2].

Previous research has shown the feasibility of generating realistic natural and medical images [1, 5, 21]. Until recently, the state-of-the-art (SOTA) results were achieved with Generative Adversarial Networks (GANs) and CycleGANs. GANs are known to be notoriously difficult to train, due to training instability as the generator and discriminator are trained simultaneously, and the loss function can be highly non-convex, making it challenging to find the global minimum. Additionally, GANs can suffer from vanishing gradients, which can lead to slow or non-existent convergence. This often leads to failed training runs and great difficulty in reproducing results. GANs are also prone to mode collapse, where the generator learns to produce a limited set of outputs, which can be repeated instead of generating diverse samples [4]. This can happen when the discriminator is too strong in comparison to the generator, and rejects any samples that do not match the training data, forcing the generator to produce only a limited set of outputs. When using a guide image for synthetic image synthesis (e.g., to generate a synthetic medical image guided by an anatomical representation), CycleGANs have been the go-to technique of choice, but typically fail to reproduce the anatomy under large transformations of the guide images and collapse to anatomy seen in the training set, as illustrated later in this paper.

Denoising Diffusion Probabilistic models (DDPMs) are more recent generative models that are far less susceptible to the pitfalls of GAN based methods [10]. However, limited research has been performed on medical images, with a few notable examples [9, 14]. No prior research studies have explored their use in semantically guided image synthesis, or to ultrasound imaging at all. In this paper, we use DDPMs to generate synthetic echo images and train a segmentation model.

1.2 Related Works

The two primary Deep Learning (DL) techniques for generating synthetic images are GANs and DDPMs.

Generative Adversarial Networks: GANs and the CycleGAN subclass have shown success in generating synthetic medical images. Examples include Cycle-MedGAN [1] that achieved $\>0.91$ Structural Similarity Index Measure (SSIM) in a Positron Emission Tomography to Computed Tomography (CT) unsupervised translation task. The SOTA in segmentation is presented by Gilbert et al. who used a CycleGAN architecture to generate images for training a network to segment an unseen real cardiac ultrasound test set. They achieved 79.4, 88.6 and 71.3% mean Dice scores on the left ventricle endocardium, left ventricle epicardium and left atrium respectively [6].

Diffusion Models: Sohl-Dickstein et al. initially proposed DDPMs [18], which have been used successfully for unsupervised image-to-image translation in both the natural and medical image domains. Pinaya et al. achieved a Fréchet inception distance (FID) of $7.8 \times 10^{-3}$ for brain Magnetic Resonance Imaging (MRI) generation versus an FID of $5 \times 10^{-4}$ for real brain MRI [14]. Lyu et al. achieved $>0.85$ SSIM score for their conversion of CT to MRI images [9].

These works relate to the problem of image-to-image domain adaption. Instead, we use diffusion models to address the issue of limited labelled training data in echo image segmentation, by using a semantically guided network that receives an anatomical semantic label map to generate conditional synthetic echo images which adhere to the anatomy. Our work aims to combine the benefits of DDPMs with semantically guided medical imaging to synthetically increase dataset size and enable the creation of out-of-distribution images.

1.3 Contributions of This Study

This is the first work to utilize DDPMs for generating medical images using semantic label maps as a source image for conditioning the generated image. To summarize our contributions, we: 1) Demonstrate that a semantic label map guided diffusion model can be trained to synthesize cardiac ultrasound images and matching semantic labels that can be used to train a segmentation model that then performs to high accuracy on real echo data, and 2) Release the generated datasets, as well as the code, for public usage.

The code is available at https://github.com/david-stojanovski/echo_from_noise. The generated diffusion model dataset, as described in Sect. 2.1 is available at https://zenodo.org/record/7921055#.ZGYS_9LMLmE.

2 Methods

As an overview, our image synthesis pipeline implements the Semantic Diffusion Model (SDM) proposed by Wang et al. [20], the details of which are described in Sect. 2.1. Subsequently, these generated images are used to train and validate a segmentation model. This model is then tested on an unseen dataset of real echocardiographic images, as described in Sect. 2.2. An overview of the pipeline is given in Fig. 1.

2.1 Synthetic Ultrasound Generation: Semantic Diffusion Model

Data: We used the CAMUS echocardiography dataset [8] which contains 500 patients in total with semantically segmented left ventricle myocardium, endocardium, and left atrial surface for 4 chamber and 2 chamber images at both end-diastole (ED) and end-systole (ES) frames. The official test subset, of 50 patients, was reserved for testing segmentation networks. We used the remaining 450 patients for training and validating the generative models, by splitting them into 400 training and 50 validation. Note that only the ED frames were used for training and validating the two diffusion models (2C and 4C), totaling 400 training + 50 validation frames used to train each model.

An additional label was added to the label maps to describe the ultrasound sector (see Fig. 1), by applying a simple thresholding to the ultrasound images.

Proposed Model: We made a few notable modifications to the Semantic Diffusion Model (SDM), including some best practices proposed by Nichol et al. [11].

Firstly we used a cosine noise schedule, instead of a linear schedule, to reduce the rate at which noise is added, thus increasing the contribution of the noise in later steps of the forward process. Secondly, the objective function being optimized is the summation of the predicted noise given an input label map at each time point and the Kullback-Leibler (KL) divergence between the estimated distribution and diffusion process posterior. We modified the weight of the KL divergence to 0.001 to stabilize the optimization.

During inference, we removed the classifier-free guidance sampling for disentanglement, as it was found to add minimal perceptible difference to generated images. This allowed us to approximately halve the inference time. Briefly, the predicted noise of the model from a semantic label map is represented by:

$$\begin{aligned} \hat{\epsilon }_{\theta }(y_t | x) = \epsilon _{\theta }(y_t | x) + s \cdot (\epsilon _{\theta }(y_t | x) - \epsilon _{\theta }(y_t | \emptyset )) \end{aligned}$$

(1)

where $\hat{\epsilon }_{\theta }(y_t | x)$ is the total-estimated noise in a ground truth image y at time step t, given an input semantic label map x, s is a user-defined guidance scale and $\epsilon _{\theta }(y_t | \emptyset )$ is estimated noise in a ground truth image given a null input, $\emptyset $.

Two separate models were trained, a model on 4 chamber (4C) end diastolic frames and a model on 2 chamber (2C) end diastolic frames. Training and inference was performed using Pytorch 1.13 [13] on 8 $\times $ Nvidia A100 graphics processing units for 50,000 steps using an annealing learning rate and a batch size of 12.

2.2 Segmentation

Data: From the 400+50 CAMUS patient images used to build the diffusion models, we took the semantic maps (4 per patient: 2C and 4C both ED and ES) to produce four synthetic datasets: 2CED, 2CES, 4CED and 4CES. For each dataset, we took the 400+50 semantic maps and applied five random transformations (each being a combination of random affine and elastic deformation), to produce a total of 2000 (training) + 250 (validation) semantic maps per dataset. Affine transformation ranges for rotation degrees, translate, scale and shear were: $(-5, 5)^\circ $, (0, 0.05), (0.8, 1.05) and $5^\circ $ respectively. The number of control points and max displacement were (10, 10, 4) and (0, 30, 30). The corresponding echo images were generated by using the previously trained SDMs (the same SDM was used for ED and ES frames from a given view). A fifth dataset is built by aggregating the other four.

The exact generated datasets constituency is shown in Fig. 2. The results given for the segmentation tasks are the values from testing on the official test split of the CAMUS dataset.

Model: We implemented an 8 layer U-Net model [16], identical to the one in [5] for fair comparison. Five instances of this model were trained with the 5 aforementioned datasets. Segmentation accuracy was assessed using the 2D Dice score [17]. The segmentation model was an in-house implementation of the standard U-Net using MONAI [3], adapted for multiple labels and image resolution of (256, 256) as input. The network contains 8 layers (L) with $2^L$ channels in each layer and was trained for 300 epochs with the Adam optimizer and a learning rate of $1 \times 10^{-3}$, $\beta _1$ of 0.9 and $\beta _2$ of 0.999. These hyperparameters were chosen based on best practices proposed in literature [7].

2.3 Baseline CycleGAN Model and Data

The CycleGAN network from [5] was used as a benchmark for the SDM ultrasound image generation. The CycleGAN was trained with 2C and 4C slices from a public dataset of 1000 synthetic cardiac meshes [15]. Data was divided into a 70/15/15% train/validation/test split and trained for 200 epochs. The training procedure was implemented as described in [19]. The Dice score comparisons were made with the originally published results by Gilbert et al. [6].

3 Experiments and Results

The final Dice scores on the unseen CAMUS test set of 50 patients are shown in Table 1. Example images generated with the SDMs in all frames for a single patient are shown in Fig. 3. We also present extreme augmentations in Fig. 4 to illustrate the robustness of the SDM model to inputs that would be out of the distribution of the training set. Figure 5 shows a comparison of our SDM model against a CycleGAN model for inferencing across a range of augmentations.

Table 1. Dice scores for the CAMUS test set. $LV_{endo}$, $LV_{epi}$ and LA denote left ventricular endocardium, epicardium and atrium respectively. Rows stating all frames show mean and standard deviation for each label from a network trained on all views. All frames refers to 2CH and 4CH images at both systole and diastole (4 images total).

Full size table

Using the CAMUS pretrained All SDM frames model and performing testing on the EchoNet-Dynamic Atrial 4 Chamber at End Diastole dataset [12], the model achieved $87.83 \pm 7.21\%$ vs $85.6 \pm 7.0\%$ obtained by Gilbert et al. when training a dataset-specific CycleGAN model.

Qualitatively, our results suggest that our SDMs are able to generate ultrasound images with superior overall image realism and propensity for adhering to anatomical input constraints, as well as the ability to generate images from extreme out-of-distribution semantic label maps. Representative examples of generated images are shown in Fig. 3, 4 and 5.

An example of our model vs CycleGAN (previous SOTA) is shown in Fig. 5. Columns 2 and 3 represent examples of CycleGAN’s limited ability to generate anatomically realistic images, even with a realistic anatomical guiding image. CycleGAN outputs show atrial walls merging, hallucinated septal wall thickening, or complete collapse of the ventricle.

4 Discussion and Conclusions

Our anatomy-guided diffusion models generate realistic echo images that, compared to SOTA methods, adhere better to the anatomical constraints of a given label map, even when the prescribed anatomy is very different from the training set. Indeed, While CycleGAN can generate realistic ultrasound images, we observed that their ability to reproduce the anatomy under large deformation of the guide images is limited. Further, the anatomically accurate synthetic data generated with our model significantly improves the performance of the segmentation model on real images, showing the potential to address challenges involving rare medical conditions, data privacy, or limited data availability.

A segmentation network trained on the SDM-generated synthetic data significantly outperformed SOTA methods, and even training on the original real data, in segmentation of real 2 and 4 chamber ultrasound images. The LV endocardium, epicardium, and atrium were segmented with high accuracy ($88.6 \pm 5.8$, $91.9\pm 4.2$, $85.2 \pm 13.2$% Dice score respectively). These results represent an 11.6, 3.7 and 19.5% improvement in relation to previous SOTA results. Moreover, our results showed reduced standard deviation for all labels, suggesting our model yields realistic images more consistently, leading to less variation in segmentation performance. These results also highlight the adaptability of SDMs for generating new data.

Future work will include addressing variation across devices and clinical centers and investigating temporal aspects of synthetic data generation. The results presented in this paper show promise for synthetic data generation that can be used to train deep neural networks to high performance, addressing a crucial problem in medical imaging such as the availability of expert-labeled data.

References

Armanious, K., Jiang, C., Abdulatif, S., Küstner, T., Gatidis, S., Yang, B.: Unsupervised medical image translation using cycle-medGAN (2019). http://arxiv.org/abs/1903.03374, https://doi.org/10.23919/EUSIPCO.2019.8902799
Armstrong, A.C., et al.: Quality control and reproducibility in m-mode, two-dimensional, and speckle tracking echocardiography acquisition and analysis: the cardia study, year 25 examination experience. Echocardiography 32, 1233–1240 (2015). https://doi.org/10.1111/echo.12832
Consortium, T.M.: Project Monai (2020). https://doi.org/10.5281/zenodo.4323059
Feng, R., Lin, Z., Zhu, J., Zhao, D., Zhou, J., Zha, Z.J.: Uncertainty principles of encoding GANs. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 3240–3251. PMLR, 18–24 July 2021. https://proceedings.mlr.press/v139/feng21c.html
Gilbert, A., Marciniak, M., Rodero, C., Lamata, P., Samset, E., McLeod, K.: Generating synthetic labeled data from existing anatomical models: an example with echocardiography segmentation. IEEE Trans. Med. Imaging 40, 2783–2794 (10 2021). https://doi.org/10.1109/TMI.2021.3051806
Gilbert, A., Marciniak, M., Rodero, C., Lamata, P., Samset, E., McLeod, K.: Supplementary materials for generating synthetic labeled data from existing anatomical models: an example with echocardiography segmentation. https://github.com/adgilbert/data-generation/blob/main/SupplementaryMaterial.pdf
Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, pp. 1–15 (2015)
Google Scholar
Leclerc, S., et al.: Deep learning for segmentation using an open large-scale dataset in 2D echocardiography. IEEE Trans. Med. Imaging 38(9), 2198–2210 (2019)
Google Scholar
Lyu, Q., Wang, G.: Conversion between CT and MRI images using diffusion and score-matching models (2022). http://arxiv.org/abs/2209.12104
Müller-Franzes, G., et al.: Diffusion probabilistic models beat GANs on medical images (2022). http://arxiv.org/abs/2212.07501
Nichol, A., Dhariwal, P.: Improved denoising diffusion probabilistic models (2021). http://arxiv.org/abs/2102.09672
Ouyang, D., et al.: Video-based AI for beat-to-beat assessment of cardiac function. Nature 580(7802), 252–256 (2020). https://doi.org/10.1038/s41586-020-2145-8
Article Google Scholar
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32 (NeurIPS) (2019)
Google Scholar
Pinaya, W.H.L., et al.: Brain imaging generation with latent diffusion models (2022). http://arxiv.org/abs/2209.07162
Rodero, C., et al.: Linking statistical shape models and simulated function in the healthy adult human heart. PLoS Comput. Biol. 17(4), 1–28 (2021). https://dx.doi.org/10.1371/journal.pcbi.1008851
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation (2015). http://arxiv.org/abs/1505.04597
Skrifter, B., Ind, B.V., Thorvald, B.S., København, R.: Det kongelige danske videnskabernes selskab a method of establish in g groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on Danish commons
Google Scholar
Sohl-Dickstein, J., Weiss, E.A., Maheswaranathan, N., Ganguli, S., Edu, S.: Deep unsupervised learning using nonequilibrium thermodynamics (2015)
Google Scholar
Stojanovski, D., Hermida, U., Muffoletto, M., Lamata, P., Beqiri, A., Gomez, A.: Efficient pix2vox++ for 3D cardiac reconstruction from 2d echo views. In: Aylward, S., Noble, J.A., Hu, Y., Lee, S.L., Baum, Z., Min, Z. (eds.) ASMUS 2022. LNCS, vol. 13565, pp. 86–95. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16902-1_9
Chapter Google Scholar
Wang, W., et al.: Semantic image synthesis via diffusion models (2022). http://arxiv.org/abs/2207.00050
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks (2017). http://arxiv.org/abs/1703.10593

Download references

Author information

Authors and Affiliations

School of Biomedical Engineering and Imaging Sciences, King’s College London, London, SE1 7EU, UK
David Stojanovski, Uxio Hermida, Pablo Lamata, Arian Beqiri & Alberto Gomez
Ultromics Ltd., Oxford, OX4 2SU, UK
Arian Beqiri & Alberto Gomez

Authors

David Stojanovski
View author publications
You can also search for this author in PubMed Google Scholar
Uxio Hermida
View author publications
You can also search for this author in PubMed Google Scholar
Pablo Lamata
View author publications
You can also search for this author in PubMed Google Scholar
Arian Beqiri
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Gomez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David Stojanovski .

Editor information

Editors and Affiliations

Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
Bernhard Kainz
University of Oxford, Oxford, UK
Alison Noble
Technical University of Munich, Munich, Germany
Julia Schnabel
Nepal Institute for Applied Mathematics and Informatics Institute for Research NAAMII, Lalitpur, Nepal
Bishesh Khanal
Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
Johanna Paula Müller
King's College London, London, UK
Thomas Day

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 264 KB)

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Stojanovski, D., Hermida, U., Lamata, P., Beqiri, A., Gomez, A. (2023). Echo from Noise: Synthetic Ultrasound Image Generation Using Diffusion Models for Real Image Segmentation. In: Kainz, B., Noble, A., Schnabel, J., Khanal, B., Müller, J.P., Day, T. (eds) Simplifying Medical Ultrasound. ASMUS 2023. Lecture Notes in Computer Science, vol 14337. Springer, Cham. https://doi.org/10.1007/978-3-031-44521-7_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-44521-7_4
Published: 01 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44520-0
Online ISBN: 978-3-031-44521-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)

Echo from Noise: Synthetic Ultrasound Image Generation Using Diffusion Models for Real Image Segmentation

Abstract

Similar content being viewed by others

UltraGAN: Ultrasound Enhancement Through Adversarial Generation

Domain and Geometry Agnostic CNNs for Left Atrium Segmentation in 3D Ultrasound

LOTUS: Learning to Optimize Task-Based US Representations

Keywords