1 Introduction

Deep learning for ocular imaging analysis has been an essential academic field in ophthalmology [1, 2]. Recently, the generation of deep-learning-based images has received considerable attention [3]. In the medical field, image synthesis can solve the problem of personal information protection and a limited amount of pathological data [4]. Generated images can have various combined significant features of the training dataset, but each one can may be different with the original images. With the advent of generative adversarial networks (GAN), various advanced image processing methods, including segmentation, data augmentation, denoising, and domain transfer, have been applied in ophthalmology image domains [5]. Although many fundus photography (FP) datasets have recently been released to the public, data on some retinal diseases are still lacking [6]. However, new image generation methods are being introduced to overcome the shortcomings of GAN, that it is difficult to tune the parameters for stable training and to avoid mode collapse [7].

The generative diffusion model has been highlighted as a state-of-the-art deep learning technique since the introduction of DALL-E by OpenAI in April 2022 [8]. Recent well-known generative diffusion models with large architecture, such as MidJourney and Copilot, have successfully synthesized high-quality images [9]. However, these popular models cannot generate retinal images because they are not trained on fundus images (Fig. 1). Even if the user directly attempts to create an image with prompts such as “retina,” “fundus photo,” and “optic nerve,” these models cannot generate any information about the blood vessels, optic nerve, and macula of the retina image. Therefore, to generate realistic FP images using the generative diffusion model, development of individual domain-specific models is required.

Fig. 1
figure 1

The Images created with the prompts “Retina” and “Fundus photography” using popular diffusion-based generative models. A MidJourney. B Copilot. Images created by the authors may be used for non-commercial purposes in accordance with each company's policies (https://help.midjourney.com/en/articles/8150363-can-i-use-my-images-commercially and https://www.bing.com/new/termsofuseimagecreator#content-policy)

The basic diffusion architecture is the denoising diffusion probabilistic model (DDPM), which is trained via diffusion steps to add random noise to images and to learn to reverse the noising diffusion to synthesize the desired images [10]. DDPM is one of the state-of-the-art architectures designed for unconditional image generation. It iteratively learns a Markov Chain functions which gradually translates a simple distribution such as Gaussian noise into a desired data distribution. Several previous studies have suggested that diffusion models have several advantages, including stable training and flexibility of outcomes [11]. After image generation using the DDPM became popular, novel diffusion techniques were constantly developed in the machine learning community and medical fields [12]. One previous study proposed Medfusion, which is a DDPM-based self-supervised auto-encoder model that produces the same image as the input image, and suggested that DDPM outperformed other generative models including GAN [13]. Medfusion learned to generate FP images based on a large dataset more than a hundred thousand images.

In general, a very large amount of data may guarantee the successful training of generative models. However, how the performance of DDPM will change in a limited amount of FP data has not been studied. In this study, we investigated whether a generative diffusion model can synthesize high-quality FPs via domain-specific training using a small dataset an unconditional manner (Fig. 2). Most previous studies of diffusion models in medical imaging have developed specific transformation functions of input and output in the form of conditional GAN or single U-Net models. In this study, we intended to show that DDPM can be used in an unconditional manner to generate images by learning the distribution of the FP dataset itself.

Fig. 2
figure 2

A schematic representation of the architecture of a denoising diffusion probabilistic model (DDPM) for fundus image generation. A The architecture of unconditional DDPM. B Training and image generation process. \({x}_{0}\) represents the fundus image domain and \({x}_{T}\) represents the pure Gaussian noise domain

To address the feasibility of DDPM to generate FP images, in the next sections, we performed a comparative experiment between DDPM and GAN models. Section 2 describes DDPM training and experiment methods. Section 3 describes the evaluation of the generated FP images and a comparison of the two unconditional image generation methods. In the final Sect. 4, the pros and cons of the two methods obtained through experimentation were compared and the results were discussed.

2 Methods

2.1 Dataset

This study was based on a publicly accessible and deidentified FP image database, which was released by previous studies [14,15,16]. We collected the healthy retinal images from each dataset and excluded pathological images. To reduce the noise of different borders, we selected FPs where all boundaries of the circular mask were intact. Finally, 1000 healthy FPs were used to train the diffusion model as a limited data situation. The data collection protocol was approved by the Institutional Review Board of the Korea National Institute for Bioethics Policy (KoNIBP). The analysis using the deidentified FP image database was performed according to the guidelines of the KoNIBP. The Institutional Review Board of the KoNIBP waived the requirement for informed consent because the data were fully deidentified to protect patient confidentiality. All the procedures were performed in accordance with the ethical standards of the 1964 Declaration of Helsinki and its later amendments. Our organized dataset used in this study is available in the data repository (https://data.mendeley.com/datasets/fm4m8kr6cz).

2.2 Diffusion model development

Diffusion models match data distributions of a training dataset by learning to reverse a gradual noising process. The technique of diffusion models is at the beginning of its use in medical imaging fields, such as medical image reconstruction [17]. Currently, DDPM is the most popular and basic diffusion model inspired by considerations from nonequilibrium in thermodynamics [10]. As shown in Fig. 2, we trained the DDPM based on U-Net backbone architecture, which is the most general form of the generative diffusion model [10]. We used the basic architecture with 64 hidden dimensions for each U-Net unit. In our DDPM model, the diffusion process was fixed to Markov chain which gradually infuse Gaussian pixel noise to the image. The detailed After training, serial multiple denoising U-Nets can generate FPs using random noise seeds. The input image size was set to a pixel resolution of 128 × 128. This was the maximum resolution required to finalize the proper training with our computational resources. We set the time step to the default value of 1000 because fewer numbers produce severely noisy or blurred synthetic images. We set the sampling time step to 250 and loss function to L1 with finite absolute value. In the training process, we set the batch size to 16, the learning rate to 2 × 10–5, and the number of training steps to 106. All codes for the implementation of the DDPM are available on the webpage (https://github.com/lucidrains/denoising-diffusion-pytorch). To increase the stability of the training, we adopted general linear transformation techniques including left and right flip, width and height translation from − 10% to + 10%, random rotation from − 10° to 10°, and zooming from − 10% to 10%. The deep learning models were developed using an NVIDIA RTX 2080Ti GPU with 4352 CUDA cores and 11 GB RAM. For reproducibility, our modified codes, which can be implemented in Google Colaboratory, were released in the code repository (https://github.com/TaeKeunToo/Diffusion).

For a comparative investigation, we generated FP images using the DDPM and progressive growing GAN (PGGAN) [4] with the same training conditions (https://github.com/tkarras/progressive_growing_of_gans). The PGGAN has been the most popular unconditional image synthesis technique with robust and reliable outcomes [7]. All hyperparameters were set to the default values. The same linear transformation techniques with our DDPM training were used for PGGAN. The image quality of the 20 images, each generated using the PGGAN and DDPM, was evaluated by two ophthalmologists (HKK and IHR). To avoid bias, we did not provide any prior information regarding the tested images to the ophthalmologists. The ophthalmologists reviewed the synthetic images via e-mails and there was no time limit for their convenience. We asked them if the image was properly generated as a good quality FP.

We also generated FP images using the Medfusion (a publicly accessible DDPM-based auto-encoder) model trained with a large dataset (101,442 samples) to compare the quality of the synthetic images (Medfusion is publicly accessible at the following link: https://huggingface.co/spaces/mueller-franzes/medfusion-app) [13]. We used Frechet inception distance (FID) score to quantitatively evaluate the generated images [3, 18]. The image quality of the synthetic images from the Medfusion model (20 images) was also evaluated by two ophthalmologists as the same protocol with the evaluation process mentioned above. We compared the FP images generated by our DDPM model to those from the Medfusion.

3 Results

The DDPM successfully generated FP images with a resolution of 128 × 128 pixels using the one-thousand training samples. Figure 3 shows that the DDPM required about one million training iterations to generate good synthetic images with a resolution of 128 × 128 pixels and 250 h to complete this training. It depended on the batch size and computation capacities. Training sometimes failed due to incorrectly adjusted learning rates and local minima. We failed to train the DDPM for 256-by-256-pixel images due to the limited computation capacity. Figure 4 shows that the trained DDPM successfully generated synthetic FP images. The random seed images were transformed into intermediate noisy pictures and, finally, into the newly generated FPs.

Fig. 3
figure 3

Results from the training process using the DDPM for the generation of fundus photographs with a resolution of 128 × 128 pixels

Fig. 4
figure 4

The final synthetic images obtained through the denoising process based on our dataset

Figure 5 shows examples of the generated retinal images. The generated images look closer to the real FPs. They showed the diversity of retinal structures and did not show a mode collapse. The optic nerve, blood vessels, and black masks of FP were well-formed in the synthetic images. There were no mode collapses and grid artifacts during the image generation of the DDPM.

Fig. 5
figure 5

Examples of synthetic retinal images generated by the denoising process of the trained diffusion model (DDPM) based on our dataset

Figure 6 shows the example images generated by the fully trained PGGAN and DDPM. Better image quality was observed in PGGAN. The PGGAN synthesized more sharpened images than the DDPM in the retinal vessels and optic discs (Fig. 6A). The image quality of the synthetic images evaluated by both experts was similar. The rates of good image quality as graded by the two ophthalmologists were 75.0% (expert A) and 85.0% (expert B) for the PGGAN, and 35% (expert A) and 20% (expert B) for the DDPM (Fig. 6B). Both ophthalmologists commented that the major cause of lower scores for the DDPM was the blurred anatomical features in the generated images. The quantitative evaluation result is shown in Table 1. The PGGAN (FID score: 41.761) achieved a better FID score than the DDPM (FID score: 65.605).

Fig. 6
figure 6

A comparison of the synthetic images with a resolution of 128 × 128 pixels generated using the generative adversarial network (PGGAN) and the diffusion model (DDPM). A Example images used for image quality comparison. B Human assessment of image quality. Image quality of the 20 images each generated using the PGGAN and DDPM was evaluated by two ophthalmologists. The values indicate the rates (%) of images which were graded as good quality

Table 1 Quantitative evaluation of synthesized fundus photography images using FID

We also generated FP images using the Medfusion model, which was trained using a large FP dataset. The result FP images from the Medfusion model are shown in Fig. 7. The rates of good image quality were 85.0% (expert A) and 90.0% (expert B) for the Medfusion. These scores were better than those of our DDPM that was trained using the small dataset (75.0% from expert A and 85.0% from expert B) (Table 2).

Fig. 7
figure 7

Examples of synthetic retinal images generated by the Medfusion [13] model trained with a large fundus photography dataset in a self-supervised auto-encoder design(101,442 images)

Table 2 A comparison of issues in the training process of the generative adversarial network (PGGAN) and the diffusion model (DDPM)

4 Discussion

In this study, we confirmed that DDPM can generate FP images unconditionally by learning only the data distribution without modifying the DDPM structure, without using a conditional design that needs supervised learning. Despite the development of deep learning, a generative artificial intelligence model that can freely generate high-quality FP images has not yet been developed. Generative AI can be a good solution for developing models to predict rare retinal diseases [19]. Looking at the current trend of technological development, a generative model capable of generating FP images of various diseases will eventually be released, and its main technology will be based on diffusion. Therefore, this study can be considered the first step to generate high-quality FP images in the future.

We investigated the feasibility of DDPM with a limited FP dataset for generating synthetic FP images. Deep learning has solved various image analysis problems, such as classification, segmentation, denoising, and data augmentation problems in ophthalmology image domains [20,21,22]. Recently, the generative diffusion model has gained wide interest as a new class of AI technique for medical image synthesis [17]. The synthetic images can be used for educational purposes and data augmentation in clinical imaging. For example, a generative diffusion model successfully generated realistic brain MRI images [23]. This technique was also used in histopathologic images for an image domain transfer task [11]. A large pretrained generative diffusion model could generate synthetic chest X-ray images using a fine-tuning process [24]. It has been introduced as a new image-translation algorithm that can provide a high-quality denoised images [12]. We developed a domain-specific generative diffusion model based on the DDPM to generate FP images using a small dataset. However, our finding shows that it is still in the early stages of development. To our knowledge, this is the first study to use a domain-specific generative diffusion model to synthesize FPs using a limited dataset.

The FP is a standardized imaging domain with a black frame and the same locations of the optic nerve and macula, and it may allow to generate new images from a small amount of data. Medical images from other domains that are not standardized may lack a thousand of training image samples. When we compared the image qualities between our DDPM (based on a small dataset) and Medfusion model (based on a large dataset), we observed that DDPM generally needs larger and diverse dataset to generate high-quality images. Although our DDPM model was trained using more standardized FP images (healthy eyes), a larger training dataset (Medfusion) provided better image synthesis performance via DDPM. According to the previous study using a large dataset, DDMP showed better performance than GAN (StyleGAN) to generate FP images [13]. However, our study showed that the performance of DDPM was reduced compared to the GAN model (PGGAN) in this small dataset.

The DDPM successfully synthesized a relatively small-sized FP image with a resolution of 128 × 128 pixels; however, it had several limitations compared with the PGGAN. Although the GAN requires following some hyperparameter tuning steps for stable training, these limitations are critical shown in Table 2. The training process of the DDPM requires more computational resources for its completion than that of the PGGAN. During our experiments, we found that general data augmentation techniques were effective in improving the image quality of both the DDPM and PGGAN. The outcome images generated by the DDPM showed more blurred features than those generated by the PGGAN, although the DDPM was trained over a more extended period. However, in PGGAN, noticeable checkerboard artifacts caused by deconvolution often appeared. During the trial-and-error steps, we failed to train the DDPM with a small amount of data and a small batch size owing to unstable gradients. Both the DDPM and PGGAN required large amounts of data for stable training. DDPM has shown to be excellent at removing noise from images [25], but its image generation ability still needs to be verified.

Our study has limitations. Due to the limit of computational capacity, fundus images have not been generated under more diverse experimental conditions. The quality of image synthesis may have decreased because the training dataset is not large. The small number of the dataset was another limitation of this study. The training failure in a large resolution (256 by 256 pixels) may be caused by the small amount of data. Leveraging distributed computing resources will be a solution to training DDPM models for larger resolutions. In addition, it was also difficult to objectively evaluate the quality and diversity of the synthetic images. The findings of this study are preliminary, so further research is needed. However, it has the advantage of showing the possibility of image generation using a diffusion model using a small FP dataset.

According to our observations, in the generation of synthetic FPs, it seems difficult for diffusion models to surpass the performance of GAN-based models using a limited dataset within the computational capabilities currently available to general researchers. Although diffusion models can overcome the shortcomings of GAN, such as mode collapse and vanishing gradient [26], the DDPM was difficult to train. The quality of the resulting images from the DDPM was lower than those obtained using GAN. The previous study reported that the diffusion model was better than GAN in an image synthesis task [27], but this study did not. One previous study showed a diffusion model was better than GAN in an image translation problem of medical pathology [11]. Further studies are required to improve the training speed of diffusion models and quality of the synthesized images to use these models for domain-specific medical imaging research. We believe that a domain-specific hyper-large diffusion model may be another solution to obtain high-quality synthetic medical images. GAN also showed low performance at first, but with the addition of ideas, it has obtained good image generation performance with relatively small datasets currently.

5 Conclusion

It is possible to use a domain-specific generative diffusion model to synthesize FPs using a limited dataset using unconditional DDPM, but it needs further study to improve image quality. We found that DDPM could generate synthetic FPs without mode collapses and grid artifacts. Unlike the experimental results of large data, our study showed that the FP image generation performance of DDPM was lower than the unconditional GAN model (PGGAN) in this small dataset. We hope that our early experience will help medical researchers to conduct further studies using diffusion models. In the future, additional research should be conducted with the goal of generating a variety of high-quality FP images based on more data and a large diffusion model.