Keywords

1 Introduction

1.1 Background and Motivation

Echocardiography (echo) is the most widely used method for evaluating the heart, because it is more cost-effective and safer than other imaging modalities, while providing high resolution images in real-time. However, a major drawback of echo is its heavy dependence on operators’ expertise to obtain high-quality images and associated anatomical and functional measurements.

Convolutional neural networks (CNNs) have shown great potential for automating medical image analysis tasks and are capable of accurately learning complex relevant features from large sets of data. A major challenge in the use of CNNs for medical imaging tasks is the need for labelling such large sets of data for model training. Further, CNN accuracy can be limited by the quality of the labels used during training, particularly in the presence of noise or other artefacts that can lead to inter-observer errors. Experienced cardiologists have been shown to have inter-observer errors up to 22% when labelling common measurements in echo [2].

Previous research has shown the feasibility of generating realistic natural and medical images [1, 5, 21]. Until recently, the state-of-the-art (SOTA) results were achieved with Generative Adversarial Networks (GANs) and CycleGANs. GANs are known to be notoriously difficult to train, due to training instability as the generator and discriminator are trained simultaneously, and the loss function can be highly non-convex, making it challenging to find the global minimum. Additionally, GANs can suffer from vanishing gradients, which can lead to slow or non-existent convergence. This often leads to failed training runs and great difficulty in reproducing results. GANs are also prone to mode collapse, where the generator learns to produce a limited set of outputs, which can be repeated instead of generating diverse samples [4]. This can happen when the discriminator is too strong in comparison to the generator, and rejects any samples that do not match the training data, forcing the generator to produce only a limited set of outputs. When using a guide image for synthetic image synthesis (e.g., to generate a synthetic medical image guided by an anatomical representation), CycleGANs have been the go-to technique of choice, but typically fail to reproduce the anatomy under large transformations of the guide images and collapse to anatomy seen in the training set, as illustrated later in this paper.

Denoising Diffusion Probabilistic models (DDPMs) are more recent generative models that are far less susceptible to the pitfalls of GAN based methods [10]. However, limited research has been performed on medical images, with a few notable examples [9, 14]. No prior research studies have explored their use in semantically guided image synthesis, or to ultrasound imaging at all. In this paper, we use DDPMs to generate synthetic echo images and train a segmentation model.

1.2 Related Works

The two primary Deep Learning (DL) techniques for generating synthetic images are GANs and DDPMs.

Generative Adversarial Networks: GANs and the CycleGAN subclass have shown success in generating synthetic medical images. Examples include Cycle-MedGAN [1] that achieved \(\>0.91\) Structural Similarity Index Measure (SSIM) in a Positron Emission Tomography to Computed Tomography (CT) unsupervised translation task. The SOTA in segmentation is presented by Gilbert et al. who used a CycleGAN architecture to generate images for training a network to segment an unseen real cardiac ultrasound test set. They achieved 79.4, 88.6 and 71.3% mean Dice scores on the left ventricle endocardium, left ventricle epicardium and left atrium respectively [6].

Diffusion Models: Sohl-Dickstein et al. initially proposed DDPMs [18], which have been used successfully for unsupervised image-to-image translation in both the natural and medical image domains. Pinaya et al. achieved a Fréchet inception distance (FID) of \(7.8 \times 10^{-3}\) for brain Magnetic Resonance Imaging (MRI) generation versus an FID of \(5 \times 10^{-4}\) for real brain MRI [14]. Lyu et al. achieved \(>0.85\) SSIM score for their conversion of CT to MRI images [9].

These works relate to the problem of image-to-image domain adaption. Instead, we use diffusion models to address the issue of limited labelled training data in echo image segmentation, by using a semantically guided network that receives an anatomical semantic label map to generate conditional synthetic echo images which adhere to the anatomy. Our work aims to combine the benefits of DDPMs with semantically guided medical imaging to synthetically increase dataset size and enable the creation of out-of-distribution images.

1.3 Contributions of This Study

This is the first work to utilize DDPMs for generating medical images using semantic label maps as a source image for conditioning the generated image. To summarize our contributions, we: 1) Demonstrate that a semantic label map guided diffusion model can be trained to synthesize cardiac ultrasound images and matching semantic labels that can be used to train a segmentation model that then performs to high accuracy on real echo data, and 2) Release the generated datasets, as well as the code, for public usage.

The code is available at https://github.com/david-stojanovski/echo_from_noise. The generated diffusion model dataset, as described in Sect. 2.1 is available at https://zenodo.org/record/7921055#.ZGYS_9LMLmE.

2 Methods

As an overview, our image synthesis pipeline implements the Semantic Diffusion Model (SDM) proposed by Wang et al. [20], the details of which are described in Sect. 2.1. Subsequently, these generated images are used to train and validate a segmentation model. This model is then tested on an unseen dataset of real echocardiographic images, as described in Sect. 2.2. An overview of the pipeline is given in Fig. 1.

Fig. 1.
figure 1

Pipeline for generating synthetic ultrasound images from a semantic diffusion model (SDM), to be used in segmentation training and testing on real data. SDM Training: The SDM is trained to transform the noise to the realistic image through an iterative denoising process. SDM inference: The trained SDM is inferenced on augmented labels to generate corresponding realistic synthetic ultrasound images. Segmentation: The set of generated synthetic ultrasound images are used to train a segmentation network. The segmentation performance is tested on real data. x denotes the semantic label map; \(y_t\) denotes the noisy image at each time step t.

2.1 Synthetic Ultrasound Generation: Semantic Diffusion Model

Data: We used the CAMUS echocardiography dataset [8] which contains 500 patients in total with semantically segmented left ventricle myocardium, endocardium, and left atrial surface for 4 chamber and 2 chamber images at both end-diastole (ED) and end-systole (ES) frames. The official test subset, of 50 patients, was reserved for testing segmentation networks. We used the remaining 450 patients for training and validating the generative models, by splitting them into 400 training and 50 validation. Note that only the ED frames were used for training and validating the two diffusion models (2C and 4C), totaling 400 training + 50 validation frames used to train each model.

An additional label was added to the label maps to describe the ultrasound sector (see Fig. 1), by applying a simple thresholding to the ultrasound images.

Proposed Model: We made a few notable modifications to the Semantic Diffusion Model (SDM), including some best practices proposed by Nichol et al. [11].

Firstly we used a cosine noise schedule, instead of a linear schedule, to reduce the rate at which noise is added, thus increasing the contribution of the noise in later steps of the forward process. Secondly, the objective function being optimized is the summation of the predicted noise given an input label map at each time point and the Kullback-Leibler (KL) divergence between the estimated distribution and diffusion process posterior. We modified the weight of the KL divergence to 0.001 to stabilize the optimization.

During inference, we removed the classifier-free guidance sampling for disentanglement, as it was found to add minimal perceptible difference to generated images. This allowed us to approximately halve the inference time. Briefly, the predicted noise of the model from a semantic label map is represented by:

$$\begin{aligned} \hat{\epsilon }_{\theta }(y_t | x) = \epsilon _{\theta }(y_t | x) + s \cdot (\epsilon _{\theta }(y_t | x) - \epsilon _{\theta }(y_t | \emptyset )) \end{aligned}$$
(1)

where \(\hat{\epsilon }_{\theta }(y_t | x)\) is the total-estimated noise in a ground truth image y at time step t, given an input semantic label map x, s is a user-defined guidance scale and \(\epsilon _{\theta }(y_t | \emptyset )\) is estimated noise in a ground truth image given a null input, \(\emptyset \).

Two separate models were trained, a model on 4 chamber (4C) end diastolic frames and a model on 2 chamber (2C) end diastolic frames. Training and inference was performed using Pytorch 1.13 [13] on 8 \(\times \) Nvidia A100 graphics processing units for 50,000 steps using an annealing learning rate and a batch size of 12.

2.2 Segmentation

Data: From the 400+50 CAMUS patient images used to build the diffusion models, we took the semantic maps (4 per patient: 2C and 4C both ED and ES) to produce four synthetic datasets: 2CED, 2CES, 4CED and 4CES. For each dataset, we took the 400+50 semantic maps and applied five random transformations (each being a combination of random affine and elastic deformation), to produce a total of 2000 (training) + 250 (validation) semantic maps per dataset. Affine transformation ranges for rotation degrees, translate, scale and shear were: \((-5, 5)^\circ \), (0, 0.05), (0.8, 1.05) and \(5^\circ \) respectively. The number of control points and max displacement were (10, 10, 4) and (0, 30, 30). The corresponding echo images were generated by using the previously trained SDMs (the same SDM was used for ED and ES frames from a given view). A fifth dataset is built by aggregating the other four.

The exact generated datasets constituency is shown in Fig. 2. The results given for the segmentation tasks are the values from testing on the official test split of the CAMUS dataset.

Fig. 2.
figure 2

Flow chart of semantic diffusion model dataset design. ED: end diastolic frames, ES: end systolic frames, 2-ch: echo 2 chamber images, 4-ch: echo 4 chamber images.

Model: We implemented an 8 layer U-Net model [16], identical to the one in [5] for fair comparison. Five instances of this model were trained with the 5 aforementioned datasets. Segmentation accuracy was assessed using the 2D Dice score [17]. The segmentation model was an in-house implementation of the standard U-Net using MONAI [3], adapted for multiple labels and image resolution of (256, 256) as input. The network contains 8 layers (L) with \(2^L\) channels in each layer and was trained for 300 epochs with the Adam optimizer and a learning rate of \(1 \times 10^{-3}\), \(\beta _1\) of 0.9 and \(\beta _2\) of 0.999. These hyperparameters were chosen based on best practices proposed in literature [7].

2.3 Baseline CycleGAN Model and Data

The CycleGAN network from [5] was used as a benchmark for the SDM ultrasound image generation. The CycleGAN was trained with 2C and 4C slices from a public dataset of 1000 synthetic cardiac meshes [15]. Data was divided into a 70/15/15% train/validation/test split and trained for 200 epochs. The training procedure was implemented as described in [19]. The Dice score comparisons were made with the originally published results by Gilbert et al. [6].

3 Experiments and Results

The final Dice scores on the unseen CAMUS test set of 50 patients are shown in Table 1. Example images generated with the SDMs in all frames for a single patient are shown in Fig. 3. We also present extreme augmentations in Fig. 4 to illustrate the robustness of the SDM model to inputs that would be out of the distribution of the training set. Figure 5 shows a comparison of our SDM model against a CycleGAN model for inferencing across a range of augmentations.

Table 1. Dice scores for the CAMUS test set. \(LV_{endo}\), \(LV_{epi}\) and LA denote left ventricular endocardium, epicardium and atrium respectively. Rows stating all frames show mean and standard deviation for each label from a network trained on all views. All frames refers to 2CH and 4CH images at both systole and diastole (4 images total).

Using the CAMUS pretrained All SDM frames model and performing testing on the EchoNet-Dynamic Atrial 4 Chamber at End Diastole dataset [12], the model achieved \(87.83 \pm 7.21\%\) vs \(85.6 \pm 7.0\%\) obtained by Gilbert et al. when training a dataset-specific CycleGAN model.

Qualitatively, our results suggest that our SDMs are able to generate ultrasound images with superior overall image realism and propensity for adhering to anatomical input constraints, as well as the ability to generate images from extreme out-of-distribution semantic label maps. Representative examples of generated images are shown in Fig. 3, 4 and 5.

Fig. 3.
figure 3

Example images generated from our trained SDM networks. A) SDM synthetic images; B) SDM synthetic images overlaid with input semantic label map; C) Synthetic image with contour of input semantic label map. ED: End diastole; ES: End systole.

Fig. 4.
figure 4

Example images generated from extreme semantic map distortions. Top row: Synthetic images; Bottom row: Synthetic images with semantic label map contours. Anatomical defects: a hand-crafted septal defect (left), and missing left atrium (right).

An example of our model vs CycleGAN (previous SOTA) is shown in Fig. 5. Columns 2 and 3 represent examples of CycleGAN’s limited ability to generate anatomically realistic images, even with a realistic anatomical guiding image. CycleGAN outputs show atrial walls merging, hallucinated septal wall thickening, or complete collapse of the ventricle.

Fig. 5.
figure 5

Comparison of generated images from the SDM and our CycleGAN network. Top row: Generated synthetic images, bottom row: Overlay of the corresponding label map. Red dashed lines and arrows point to anatomical features missed or wrongly predicted by the CycleGAN model. Green dashed lines and arrows point to anatomical features correctly predicted by the SDM model. (Color figure online)

4 Discussion and Conclusions

Our anatomy-guided diffusion models generate realistic echo images that, compared to SOTA methods, adhere better to the anatomical constraints of a given label map, even when the prescribed anatomy is very different from the training set. Indeed, While CycleGAN can generate realistic ultrasound images, we observed that their ability to reproduce the anatomy under large deformation of the guide images is limited. Further, the anatomically accurate synthetic data generated with our model significantly improves the performance of the segmentation model on real images, showing the potential to address challenges involving rare medical conditions, data privacy, or limited data availability.

A segmentation network trained on the SDM-generated synthetic data significantly outperformed SOTA methods, and even training on the original real data, in segmentation of real 2 and 4 chamber ultrasound images. The LV endocardium, epicardium, and atrium were segmented with high accuracy (\(88.6 \pm 5.8\), \(91.9\pm 4.2\), \(85.2 \pm 13.2\)% Dice score respectively). These results represent an 11.6, 3.7 and 19.5% improvement in relation to previous SOTA results. Moreover, our results showed reduced standard deviation for all labels, suggesting our model yields realistic images more consistently, leading to less variation in segmentation performance. These results also highlight the adaptability of SDMs for generating new data.

Future work will include addressing variation across devices and clinical centers and investigating temporal aspects of synthetic data generation. The results presented in this paper show promise for synthetic data generation that can be used to train deep neural networks to high performance, addressing a crucial problem in medical imaging such as the availability of expert-labeled data.