1 Introduction

Deep Learning (DL) has exhibited exceptional performance in medical image analysis tasks in recent years [1] for instance classification of lung and colon cancer [2], diagnosing of cervical cancer prognosis [2], Improving hepatocellular carcinoma fatality prognosis [3], and classifying the grade of an invasive ductal carcinomas breast cancer [4]. However, all DL models require extensive datasets with diverse images and accurate annotations. Acquiring and annotating fundus images is time-consuming, tedious, expert-dependent, and expensive. Therefore, researchers opt to use generative models like Generative Adversarial Networks (GAN) [5] and Variational Autoencoder (VAE) [6] to artificially generate new synthetic images. These models are a unique area of DL research and have demonstrated immense potential for integrating medical domain knowledge into DL models [7].

GAN consists of two neural networks: generator (G) and discriminator (D) which are trained simultaneously in a min–max game to learn the distribution of actual images from noise (z) to generate realistic images that can deceive D. D is a binary classifier that distinguishes between real (1) and fake (0) samples. The two networks improve until they reach a convergence point (Nash equilibrium) where no further improvement can be made.

VAE has an encoder and a decoder neural network. The encoder maps input images (x) into a latent representation (z), while the decoder reconstructs the image from the latent representation to be identical to the input image. GAN and VAE have been experimented together to enhance the vanilla VAE [8] and used in both combined or single models to address the shortage of medical images. However, these models have limitations, as shown in Fig. 1 and discussed here: 1) Synthetic images lack vessel abundance, and exhibit extreme tortuosity with vessels come from nowhere [9]. 2) Vessels are gathered in two main arcades and weaken as they further extend [10]. 3) Inability to produce extremely thin vessels [11]. 4) Missing or duplicated optic disc in the synthetic images [10, 11]. 5) The produced images included hazy disc boundaries [10, 12, 13]. 6) Limited number of generated/synthesized images with lack of anatomical characteristics [9, 10]. 7) Synthetic image quality inferior to genuine images [9,10,11,12,13].

Fig. 1
figure 1

Samples of synthetic images by [10, 11] and [9] ordered from left to right, respectively

Models in literature are categorized into two groups based on their architectural design. The first group utilizes two-stage pipelines that combine VAE-GAN or GAN-GAN architectures for generating fundus images, as seen in [9,10,11, 13]. The second group uses a single GAN architecture to synthesize retinal images, as demonstrated in [14,15,16,17]. Despite their exceptional performance, there is still scope for improvement, as mentioned earlier. The primary reasons behind these limitations are:

First, to generate high-quality synthetic images with complex structures like vessels, macula, and optic disc, [18] recommends having the same number of iterations (t) for both G and D, with t > 4. Increasing t to train D more than G can also improve synthetic image quality, as observed in [18]. A competitive D is crucial for a robust GAN model, as emphasized by [19]. [17] proposes a different training strategy, with G using double the iterations compared to D to balance the learning flow. In contrast, the vanilla GAN [5] uses double the iterations with D over G in the learning process. Balancing the learning process between the two players is critical and can lead to collapse if not handled properly, as mentioned in [20].

Balancing the training process becomes more challenging in the two-stage pipeline method, where two models learn simultaneously and the second model depends entirely on the output of the first model [16]. If the first model produces low-quality images, the second model's performance will be reduced, resulting in a decrease in the entire pipeline's performance. [10, 21] depend on the segmentation efficiency of the first model, leading to visual artifacts when under-segmentation of vessel structures.

Second, VAE produces blurry images during reconstruction [13]. Although, recent studies like [10, 22, 23] suggested using a hybrid method of VAE-GAN, with the GAN discriminator replacing VAE's decoder to improve loss function calculation. However, the resulting images from the modified VAE may still exhibit blurriness and dotted structures (Fig. 2), requiring pre-processing before feeding into GAN architecture.

Fig. 2
figure 2

Shows the blurriness and dotted structure in the generated images using the hybrid methods of VAE-GAN, real images at the top, and reconstructed images below

Third, some studies [11, 21] utilize pre-existing images for their model training, whereas the generator should learn from a regularized latent space to produce unlimited synthesized images, independent of a specific image count.

Fourth, Using a single GAN model, as in [15, 16] is challenging due to the difficulty in controlling GAN's latent space compared to VAEs. Additionally, GAN lacks continuity properties required for certain applications [24].

Lastly, some existing methods [11, 13, 17, 18] use a single mask in synthesizing images, causing the models to prioritize the mask and ignore other fundoscopic characteristics. This leads to blurry shapes, unclear optic disc boundaries, and a lack of anatomical details in the resulting images.

This paper aims to combine VAE and GAN architectures to address the limitations of individual generative methods. The advantage of combining the two architectures lies in the capability of the VAE architectures to represent images in the latent space with large variability, and in the capability of the GANs architecture to produce sharp images with high resolution and good perceptual quality [25, 26]. These two privileges will be complementary employed to address the blurred and low quality images generated by the VAE architecture [13], and addressing the difficulty of GAN to capture the full data distribution [27], and the complexity of controlling the latent space for better generality [28]. Therefore, in this work we propose multiple VAEs to synthesize both vessel structure and optic disc separately, fuse the generated masks and manipulate them with SVV function beforehand feeding them to image-to-image translation [29], lastly we compare the synthesized pairs with real pairs using GAN architecture.

The significant contributions of this work are summarized as follows:

  • A new SVV layer assists in sharpening and varying vessels' abundance in the reconstructed tabular structure.

  • To our best knowledge, this work is the first to synthesize fundus images while controlling the complexity of vessels structure.

  • Multiple landmarks are involved in the synthesis process.

  • Extensive experiments demonstrate the effectiveness of the proposed method, including visual assessment, qualitative evaluation using the SSIM similarity index, and quantitative evaluation using downstream segmentation and classification tasks.

The paper is organized as follows: Sect. 2 provides a review of related literature, Sect. 3 details the proposed method, and Sect. 4 explains the environment setup, datasets, framework architecture, training strategy, and evaluation metrics. Section 5 presents the qualitative and quantitative evaluations, and Sect. 6 discusses the study's limitations.

2 Related Literature

This section will focus on the generative model used to synthesis retinal images and will discuss their architectures, datasets, and evaluation metrics in addition to the pros and cons of their methodologies.

[30] proposed a vessel generation method with high-quality images, validated qualitatively by experts’ visual perceptions scoring 12 synthetic retinal fundus images and quantitatively using VAMPIRE segmentation algorithm with 10 synthetic and real images from HRF dataset. Although, they were able to generate retinal images without employing deep learning methods. However, complex computations were required due to the large image resolution. Furthermore, because the proposed method is not a convolutional-backbone method, the model encounters limited capabilities to extract deep and complex features as a result the quality of the synthetic images inferior to realistic images.

[17] proposed MI-GAN framework for two tasks: synthesizing retinal vessels from only a few samples and segmenting real/fake images. They modified the generator equation by replacing L1 function with cross-entropy loss plus the sum of style loss, content loss, and total variation loss. They validated their method by incorporating different discriminators (i.e., patch GAN and Image GAN) in a segmentation task. Their experiments on DRIVE and STARE datasets showed that their method outperformed existing methods and surpassed expert ability. Although, they were able to produce unlimited number of synthetic images from same input, using a small training set (only tens of images). However, their method increases the rate of false negatives in the vessel edges and endpoints, as it tends to assign low probabilities to pixels within uncertain areas.

Similarly, [15] developed Tub-sGAN, a GAN framework that synthesizes multiple images from a single binary vessel segmentation input using style transfer (including style loss, content loss, and total variation loss). Tub-sGAN can learn from small training sets of less than ten images, and was trained on four datasets (DRIVE, STARE, HRF, and NeuB1). Downstream segmentation tasks and SSIM [31] were used to validate the synthetic images. Although, their synthetic images excel in preserving proper connectivity of the vessel trees, and their model can generate different outputs from the same tubular structured annotations. However, it is relatively weak in synthesizing local details, such as exudate regions. Furthermore, some anatomical details are less than perfect such as the boundaries of the optic disc are often not as clear as those of the real images, and the macular region is sometimes also not entirely accurate.

[16] proposed a glaucoma assessment using a retinal image synthesizer and semi-supervised learning with DCGAN [32]. Their method was trained on a small glaucoma-labelled dataset and a large unlabeled dataset comprising 86,926 cropped retinal images from 14 datasets. To validate their approach, they performed a quantitative evaluation by comparing the pixel proportions of optic disc and vessel network structures in real and synthetic images, as well as the 2D-histogram and mean squared error between the two. This work is the first that used a semi-supervised learning method and a retinal image synthesizer to generate unlimited number of glaucoma-labeled images. Their method can generate images synthetically and provide the labels automatically. Although, the number of retinal images used during training is significantly greater than any other work in the literature, they were unable to generate synthetic images better than the DCGAN [32] or Costa’s method [10].

[21] introduced a novel image synthesis method based on image-to-image translation [29] and adversarial learning. They used a U-net architecture to extract a binary vessel tree from the actual fundus image and trained a pix2pix network on image pairs to map the binary vessel map to a retinal image using the global L1 and GAN adversarial losses. The Messidor-DB1 dataset was used for training, and evaluation metrics included Q­v [33] and Image Structure Clustering (ISC) [34] scores. Although, the synthetic images exhibit noticeable disparities in prominent visual characteristics, such as the image's color, tone, and illumination. However, the primary drawback of their method is its reliance on an existing vessel tree to generate a new image. Additionally, if the vessel tree is obtained through a segmentation technique applied to the original image, any weaknesses inherent in the segmentation algorithm will be carried over to the synthesized image.

Moreover, [10] developed an end-to-end retinal image synthesizer using Adversarial AE and GAN architecture. The model can generate fundus images by sampling the latent space from a probability distribution. The model was trained on the Messidor-1 dataset and validated visually and quantitatively using segmentation models and specific evaluation metrics, including Mutual Information (MI) to measure information overlap between real and synthetic blood vessels, and the Image Structure Clustering (ISC) metric to assess relevant retinal anatomical structures. Despite the ability of their method to generate realistic synthetic images that significantly deviate from the examples in the training set, with smooth variations in color and texture, and accurately placed optic discs, the resulting images still exhibit artifacts like broken tubular structures and chessboard patterns.

[9] proposed a two-stage pipeline method using DCGAN and pix2pix GAN architectures to synthesize vasculature and retinal images. The first architecture was trained on the DRIVE dataset to generate vasculature images from noise, while the second architecture used the Messidor-1 dataset to generate corresponding retinal images. Synthetic images were validated using a U-Net segmentation model trained on real images from DRIVE and synthetic images, with F1-score used to assess segmentation results. The Kullback–Leibler divergence (KL) score was also used to show that synthetic images were distinct from actual images and that the model did not memorize training data. Although, their method was able to produce larger quantities of images that are made publicly available to be used in data-driven machine learning tasks. However, their synthetic images exhibit artifacts, extreme tortuosity vessels, missed optic disc or macula, hazy optic discs with unclear boundaries, and vessels comes from nowhere.

Other studies such as [35] have worked on synthesizing digital camera noise to generate realistic images. This approach is based on a conditional GAN training scheme using Style Loss to supervise the generator training and Gaussian noise injection into each decoder block. The approach looks attractive; however, it hasn’t been adopted yet in ophthalmology to generate retinal images. Furthermore, [36] presented a latent diffusion model to synthesize high-resolution images quickly and efficiently, applying the diffusion model in the latent space of powerful pre-trained autoencoders. This approach allows preserving image details and reduce training time complexity. Notably, this approach has not been exploited in retinal image generation.

In conclusion, there is no specific evaluation metric for synthesized retinal images, but most studies evaluate them through downstream segmentation tasks and specific metrics. Other methods, such as similarity measurements or expert assessment, can also be used. Figure 3 shows a taxonomy of evaluation methods used in related literature and specifies the evaluation metrics followed in this study.

Fig. 3
figure 3

A taxonomy of image quality assessment methods used in the literature. Red color are the review/survey studies that recommend these methods, while blue color are studies that applied them. The green ticks indicate the methods used in this work

2.1 Method

We proposed a multi-stage pipeline for retinal image synthesis in this work. The framework consists of two VAE architectures and a GAN architecture to generate realistic fundus images with vessel trees and optic disc masks. The synthetic images were evaluated qualitatively and quantitatively to demonstrate the usefulness of the proposed method. The framework architecture, presented in Fig. 4, is divided into three sections: blood vessel and optic disc synthesis, masks to retinal image translation, and latent space to retinal image synthesis.

Fig. 4
figure 4

The proposed framework consists of three generative models. The VAEBV and VAEOD are in purple rectangles to generate the vessel tree and optic disc masks, respectively. The pix2pix GAN architecture is in a blue rectangle to produce the synthetic image, and the SVV layer is in a red rectangle to sharpen and diversity control of the generated vessels

2.2 Blood Vessel and Optic Disc Synthesis

Generating realistic vessel trees and optic disc masks in addition to the synthesized fundus image is a fundamental aspect of an end-to-end fundus image synthesis system. In this section, the VAEs generate unlimited blood vessels and optic disc masks with plausible anatomical structures and high variability.

The VAE architecture encodes training images into a latent representation, z ∼ Q(x) = q(z|x), using an encoder network (Q) and decoder network (P). The encoder network's hyperparameters (θ1) are trained to generate a distribution Q(x,θ1), which can be used to sample a latent variable z. The decoder network's hyperparameters (θ2) are trained to reconstruct the latent variable back to its original form, ẍ ∼ P(z) = p(x|z), which belongs to the input data distribution. However, to overcome the difficulty of VAE to generate new samples close to the real ones due to uncontrollable latent representation space. we applied a modification proposed by [37], which combined VAE with GAN by replacing the GAN generator with the VAE encoder as the generator component in the adversarial game. This can better regularize the generative model and enforce the generator network to follow the pre-specified prior distribution in the production of latent representations. The discriminator is trained to distinguish whether a sample is from the latent representation or the true normal distribution. Figure 5 shows the modified VAE architecture.

Fig. 5
figure 5

Combination of VAE and GAN to generate vessel tree and optic disc

Therefore, inspired by the original GAN Eq. (1 +), the modified GAN equation will replace the G with the encoder q(z|x) of VAE, as in Eqs. (2) and (3):

$$\underset{G}{\mathrm{min}}\underset{D}{\mathrm{max}}V\left(D,G\right)={E}_{x\sim {p}_{data}\left(x\right)} \left[logD\left(x\right)\right] + {E}_{z\sim {p}_{z}\left(z\right)} \left[log\left(1 -D\left(G\left(z\right)\right)\right)\right]$$
(1)
$${L}_{BV}({D}_{BV},\mathrm{q}) ={{E}_{z\sim {p}_{\left(z\right)}}}_{ }[\mathrm{log}({D}_{BV}\left(\mathrm{z}\right)] +{ E}_{v\sim {p}_{data}(x)}\left[\mathrm{log }(1 -{D}_{BV}\left(q\left(z|x\right)\right))\right])$$
(2)
$${L}_{OD}({D}_{OD},\mathrm{q}) ={{E}_{z\sim {p}_{\left(z\right)}}}_{ }[\mathrm{log}({D}_{OD}\left(\mathrm{z}\right)] +{ E}_{v\sim {p}_{data}(x)}\left[\mathrm{log }(1 -{D}_{OD}\left(q\left(z|x\right)\right))\right])$$
(3)

Both q and p weights are updated to minimize the reconstruction error, while also maximizing the error rate of D. By adding the reconstruction loss and Kullback–Leibler divergence loss (LKL) specified in [4] to [2] and [3] equations, the final loss function of the adversarial autoencoder VAEBV/ VAEOD that controls the learning process will be a combination of \({L}_{OD}/{L}_{BV}\), reconstruction loss, and \({L}_{KL}\) losses as follow:

$${L}_{KL}={D}_{KL}\left(q\left(z|x\right) \| p(x|z)\right),$$
(4)
$${L}_{VAE\_BV}\left({D}_{BV},\mathrm{q},\mathrm{p}\right)={L}_{BV}\left({D}_{BV},q\right)+\mathrm{ \yen }{L}_{Rec}\left[q,\mathrm{p}\right]+{L}_{KL},$$
(5)
$${L}_{VAE\_OD}\left({D}_{OD},\mathrm{q},\mathrm{p}\right)={L}_{OD}\left({D}_{OD},q\right)+\mathrm{ \yen }{L}_{Rec}\left[q,\mathrm{p}\right]+{L}_{KL},$$
(6)

Similarly as in the work by [10], \(\mathrm{\yen }\) is set to 100 to balance the two classes. The training mechanism here mimic the min–max game of the vanilla GAN, both \(q\) and \(p\) tasks are to minimize the overall loss, while DBV/DOD maximizes it. Once Nash equilibrium is achieved in Eqs. (5,6), the encoder (\(q\)) can synthesize vessel tree/optic disc mask from the latent distribution.

For sharper and clean synthesized vessels, an additional layer named sharpening and varying vessels (SVV) is placed right after the modified VAEBV. The SVV layer works on sharpening blurred pixels and varies the abundance of vessels in the reconstructed vessel tree. As Fig. 6 shows the attached SVV to the VAEBV. The SVV layer includes a filter of \((9\times 9)\) in size that sharpens the input image through a convolution process to retain the highest pixel values and enhance spatial resolution by emphasizing the boundaries of each pixel. This sharpening process is followed by batch normalization and a sigmoid activation function. The resulting image is then passed through a lambda function, where a factor called Ұ is randomly assigned during each iteration to regulate the number of vessels generated and to introduce variability in the output, as follows:

$$\ddot{\mathrm{x}}(\mathrm{i},\mathrm{ j})=\left\{\begin{array}{cc}\ddot{\mathrm{x}}(\mathrm{i},\mathrm{j})& \mathrm{if }\ddot{\mathrm{x}}\left(i,j\right)>=\yen , \\ 0& \mathrm{otherwise}\end{array}\right.$$
(7)

where \((\mathrm{\yen })\)randomly set between 0.2 ans 0.4.

Fig. 6
figure 6

Sharpening and varying layer attached to the VAEBV

In this equation, (ẍ) is the generated vessel tree, (i,j) represent the pixel coordinates, and the value of ¥ is to control the complexity of the generated vessels structure which is randomly generated value between 0.2 and 0.4. As shown in Fig. 11, if the value of Ұ is set to 0.2, less vessels will be generated, while if the value of Ұ is set to 0.4 too many vessels will be generated. The optimal Ұ value for achieving a realistic appearance of vessel structures is when the value is set to 0.3. The threshold value of Ұ controls the diversity in vessel structures. A low threshold allows only the passage of vessel pixels below the threshold, resulting in a sparse vessel structure. Conversely, a high threshold value allows a more significant number of vessel pixels to pass, resulting in a more abundant vessel structure.

To evaluate the effectiveness of the proposed SVV layer, the visual comparison is performed with various image processing techniques, see Fig. 7. Considering that the performance of a single layer within a model typically performs less effectively than a dedicated deep learning model specifically designed for image enhancement purposes, we compared our proposed layer with traditional sharpening techniques rather than other deep learning-based methods that primarily focus on image sharpening. Needless to mention the primary goal of this study is to generate realistic fundus images not to maintain noisy/blurred images.

Fig. 7
figure 7

Performance of SVV layer for sharpening vessels compared to other sharpening filters. SE stands for Sharpening Estimation value

Furthermore, we utilized the sharpness estimation equation proposed by [38] to evaluate the sharpness of the images. As depicted in Fig. 7, the reconstructed vessels display some blurriness and fuzzy texture, primarily attributed to the inherent limitations of the VAE architecture. Taking the score of the reconstructed vessels as the reference for sharpness estimation, we observed that the SVV layer yielded the closest score to the reference images while maintaining the appearance and continuity of the vessels. Although other sharpening filters produced higher estimation scores than the SVV layer, they often resulted in corrupted vessel structures or influenced pixel contrast, forming halos around the vessels.

2.3 Masks to Retinal Image Translation

This section involves training the model to generate a retinal image from existing vessel tree and optic disc masks. This is done through an image-to-image translation process, based on a method proposed in [29]. The approach involves two adversarial neural networks that emulate the competition of GAN. The G is trained to map the merged vessel tree and optic disc masks (x) to a new representation (r) while maximizing the misclassification error to deceive the D. On the other hand, the D aims to distinguish real and generated images (Eq. 1). The adversarial loss is then formulated accordingly.

$${{\varvec{L}}}_{{\varvec{a}}{\varvec{d}}{\varvec{v}}}\left(\mathbf{G},\mathbf{D}\right)={E}_{x,r\sim {p}_{data}\left(x,r\right)} \left[logD\left(x,r\right)\right] + {E}_{x\sim {p}_{data}\left(x\right)} \left[log\left(1 -D(x,G(x)\right)\right]$$
(8)

The term "pdata(x)" refers to the distribution of real vessel trees, while "Ex,r∼pdata(x,r)" denotes the expectation over pairs (x, r) that are sampled from the joint data distribution "pdata(v, r)".

Recent research [14, 29] has shown that combining the global L1 loss function with equation [8] results in visually sharp images. The adversarial loss penalizes smoothed regions and promotes sharp images, while the L1 loss function preserves global consistency. To achieve this, equation [8] was modified by adding the L1 loss function as follow.

$${{\varvec{L}}}_{{\varvec{P}}{\varvec{i}}{\varvec{x}}2{\varvec{P}}{\varvec{i}}{\varvec{x}}}\left(\mathbf{G},\mathbf{D}\right) ={L}_{adv}\left(\mathrm{G},\mathrm{D}\right)+\updelta { E}_{x,r\sim {p}_{data}\left(x,r\right)} \left[\| r -G\left(x\right){\| }_{1}\right],$$
(9)

The variable "δ" balances the two losses and is set to 100, as in the original paper by [10]. The G aims to maintain global regularity and consistency in visual features with the help of the L1 loss function. Meanwhile, the D is trained to differentiate between real and generated N x N image regions. The image-to-image translation problem is a part of the overall framework, and its architecture is depicted in Fig. 8.

Fig. 8
figure 8

G converts the vessel tree to a colored retinal image, while D identifies synthetic and real pairs

2.4 Latent Space to Retinal Image Synthesis

This section combines autoencoder models with an image-to-image translation model to create the proposed framework that generates a retinal image with vessel tree and optic disc masks from a random sample. To produce a realistic retinal image, balancing the training process of these models is critical. We train all models simultaneously and sum up their loss functions using addition. The adversarial loss of GAN, defined in (8), is used with the fake image input generated by autoencoders instead of a normal GAN situation. As all models' tasks are interconnected, they act as G component and are trained to deceive D by generating a plausible retinal image (r) that implicitly contains a plausible vessel tree and optic disc, as follows:

$${\mathrm{\acute{L} }}_{{\varvec{a}}{\varvec{d}}{\varvec{v}}}\left(\mathbf{G},\mathbf{D}\right)={E}_{x,r\sim {p}_{data}\left(x,r\right)} \left[logD\left(x,r\right)\right] + {E}_{x\sim {p}_{data}\left(x\right)} \left[log\left(1 -D(x,\mathrm{ p}(\mathrm{q}\left(\mathrm{v}\right) )\right))\right]$$
(10)

Then the modified loss function of the image-to-image translation model defined in (9) should be updated as follow:

$${\mathrm{\acute{L} }}_{{\varvec{P}}{\varvec{i}}{\varvec{x}}2{\varvec{P}}{\varvec{i}}{\varvec{x}}}(\mathbf{G},\mathbf{D}) ={\mathrm{\acute{L} }}_{adv}(\mathrm{G},\mathrm{D})+\updelta { E}_{x,r\sim {p}_{data}\left(x,r\right)} \left[\| r -\mathrm{p}(\mathrm{q}\left(\mathrm{v}\right)){\| }_{1}\right]$$
(11)

Lastly, the modified loss functions in (10) and (11) are combined with the loss functions of the VAEs defined in (5) and (6) to generate the global loss of the entire framework, as follow:

$${{\varvec{L}}}_{{\varvec{G}}{\varvec{l}}{\varvec{o}}{\varvec{b}}{\varvec{a}}{\varvec{l}}}\left(\mathrm{G},\mathrm{D},{\mathrm{D}}_{\mathrm{BV}},{\mathrm{q}}_{\mathrm{BV}},{\mathrm{p}}_{\mathrm{BV}},{\mathrm{D}}_{\mathrm{OD}},{\mathrm{q}}_{\mathrm{OD}},{\mathrm{p}}_{\mathrm{OD}}\right)={\mathrm{\acute{L} }}_{Pix2Pix}\left(G,D\right)+{L}_{VA{E}_{B}V}\left({D}_{BV},\mathrm{q},\mathrm{p}\right)+{ L}_{VA{E}_{O}D}\left({D}_{OD},\mathrm{q},\mathrm{p}\right)$$
(12)

D, DBV, and DOD aim to maximize the loss, while G, qBV, pBV, qOD, and pOD aim to minimize it in the equation. Joint training helps D improve VAEBV and VAEOD, and G benefits both VAEs by producing a realistic fundus image with SVV layer, which maximizes D's classification error. Figure 9 depicts the model combination.

Fig. 9
figure 9

Presents the combined models and SVV layer. VAEBV, VAEOD, and GAN aim to minimize loss between (x,r) and their outputs while maximizing the classification error of D, DBV, and DOD with SVV layer. DBV/DOD distinguish p(z) samples from encoders' latent representations, while D distinguishes synthetic and natural pairs

3 Expressions Implementation and Training

3.1 Dataset

The proposed framework was trained on the publicly available Messidor-1 dataset [39], which contains 1200 fundus images with four grades of diabetic retinopathy. As there is no ground truth of the blood vessels, a U-net model trained on the DRIVE dataset [40] was used to extract the vessel tree. 254 images from Messidor-1 were excluded due to advanced diabetic retinopathy. The final set of 946 retinal images was downscaled to 256 × 256 and randomly split into training (614), validation (155), and testing (177) sets.

3.2 Models’ Architecture

The proposed framework has the same architecture as [10], with eight blocks in the encoders of both VAEBV and VAEOD. Each block has two convolutional layers with the same kernel size and different strides, except for the last block, which has only one convolutional layer. Figure 10 shows the block architecture for both VAEs. Dropouts with 0.5 were used in the 5th, 6th, and 7th layers after every activation function. Each encoder outputs two fully connected layers for mean μ(x) and standard deviation σ(x) with 32 units. The decoder has the same architecture as the encoder but with upsampling layers and a fully connected input layer to receive the encoder's output.

Fig. 10
figure 10

Encoder and decoder architectures for each VAE. Letters ‘C’ and ‘Up’ denote convolutional and upsampling processes, respectively

The GAN architecture in this work is based on [29], which uses a U-net with 8 blocks for both the encoder and decoder. Each block includes a (3 × 3) convolutional layer, followed by batch normalization and LeakyReLU activation. Dropouts were used in the first three blocks, and a sigmoid activation function was used in the final block. The discriminator D has the same architecture as the G encoder and is used to classify 16 × 16 patches. The DBV/DOD consists of two fully connected layers with LeakyReLU and sigmoid activation functions, respectively.

Properly balancing the training process of GAN and VAE architectures is crucial. Poor tuning may result in noisy images lacking complex structures and the discriminator being unable to learn distinguishable features. Additionally, the discriminator could become stronger than the generator and remain unaffected by the small changes performed on synthetic images. The typical training method for vanilla GAN [5], did not produce satisfactory results in our case, as it was intended for generating images of digits that lack the intricacies present in our images, such as vessels, optic disc, and macula. Therefore, we employed a distinct training approach for our models, presented in Table 1.

Table 1 Presents the training strategy of the proposed model

After the model is trained, neither input vessel mask nor disc mark is needed as prior requirement to produce an image. However, the advantage of an end-to-end framework is to generate a complete retinal image using some specific features. In our case, the trained VAEs will be able to reconstruct the vessel structure and the optic disc features from a specific sample point randomly picked from the regularized latent space, the reconstructed masks are then sharpened and fused to be converted into a complete retinal image with the help of image-to-image translation that is performed by the generator (G) of the GAN architecture, and the discriminator D is to distinguish between the real and artificially created images.

3.3 Evaluation Metrics

For each application, a particular quality measure should be employed [41,42,43,44]. In line with the previous studies discussed in literature and the suggestions of [45], we used the Structural SIMilarity Index (SSIM) [31] to evaluate the quality of the synthesized images, which is commonly used in medical image applications [14]. The SSIM is a perceptual metric that measures the loss in image quality due to processing by comparing two images of the same scene. A higher SSIM value (close to 1) indicates greater similarity between real and synthesized images. Additionally, we evaluated downstream tasks such as segmentation and classification as recommended by [15, 50]. The classifier's Precision, Recall, and F1-score were calculated based on the reported false positives rate (FP), false negatives rate (FN), and true positives rate (TP), using the following equations:

$$\mathrm{Precision }= \frac{TP}{TP + FP},$$
(13)
$$\mathrm{Recall }= \frac{TP}{TP + FN},$$
(14)
$$\mathrm{F}1-\mathrm{score }= 2 . \frac{\mathrm{Recall }\times PPV}{\mathrm{Recall }+ PPV},$$
(15)

Furthermore, histograms is used to evaluate the similarity between real and artificial datasets, and the KL-divergence score is calculated to estimate the difference between them.

4 Quantity and Quality Evaluation

To ensure fair assessment, real and synthetic images were randomly chosen for both quantitative and qualitative experiments. To match the size of synthesized images, real images were downscaled to 256 × 256 as the model generates this resolution. Low retinal image resolution can still achieve state-of-the-art performance as shown in [46] for diabetic retinopathy classification. All the evaluation experiments were conducted on the Python 3.7.13 platform, utilizing the open-source Keras library (version 2.1.1) and TensorFlow-gpu (version 1.15.0). The experiments were performed on an MS Windows 11 operating system environment, running on an Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz (12 CPUs), 16GB RAM, NVIDIA GeForce RTX 2060-16GB GPU, and CUDA Toolkit 10.0.130.

For the training parameters, the model was trained with a learning rate of 0.0002 using the Adam optimizer with a beta_1 value of 0.5. The training iteration was set to 500, and the batch size was 20. Additionally, the dimension of the latent space was configured to 32.

4.1 Visual Images Evaluation

In this section, we present synthesized images generated by our model with their corresponding vessel tree and optic disc for visual evaluation (see Fig. 11). The retinal contents were correctly placed within the Field of View, and the coloring and illumination were visually acceptable, indicating the model's robustness. The SVV layer in our proposed model improved the generator network's capability to synthesize retinal images efficiently with varying vessel abundance, as demonstrated in Fig. 11.

Fig. 11
figure 11

Synthesized images and vessels’ masks with different Ұ values

Note that all images in Fig. 11 were generated without the need for pre-existing vessel trees or fundus images, demonstrating the model's ability to produce an unlimited number of unique images. To ensure the model did not memorize the training images, we analyzed the distance between the synthetic and training images using the visual information fidelity (VIF) method proposed by [47]. This method calculates mutual information obtained by the human visual system (HVS) channel for both training and synthetic images separately for later comparison, which is widely used in medical image registration [48]. The VIF analysis confirmed that our model can generate realistic images that are visually different from the training set, indicating its excellent generalization capability. Equation (16) shows the calculation of VIF:

$$VIF= \frac{ {\Sigma }_{j \in subbands} I\left({\overline{C} }^{N,j} ; {\overline{F} }^{N,j} \right| {S}^{N,j}= {s}^{N,j})}{ {\Sigma }_{j \in subbands} I\left({\overline{C} }^{N,j} ; {\overline{E} }^{N,j} \right| {S}^{N,j}= {s}^{N,j})},$$
(16)

where \(I\left({\overline{C} }^{N} ; {\overline{F} }^{N} \right| {s}^{N})\) and \(I\left({\overline{C} }^{N} ; {\overline{E} }^{N} \right| {s}^{N})\) represent the information that human brain can extract from the training and the synthetic images, respectively.

Figure 12 compares our synthetic retinal images, achieving the highest VIF compared to the training set, and existing literature studies that focused on synthesizing Messidor-1 DB images. According to a review study by [45] and our extensive research, only [10, 21], and [9] studies have explored the synthesis of retinal images on Messidor-1 dataset, making them vital to perform visual comparison as shown in Fig. 12.

Fig. 12
figure 12

A comparison between our synthetic images, which achieved the highest VIF (Visual Information Fidelity), and other literature methods that were applied to the Messidor-1 dataset. The rows, from top to bottom, display the following: (A) a real fundus image, (B) our synthetic images, (C) synthetic images generated by [10], (D) synthetic images produced by [21], and (E) synthetic images created by [9]

Our synthetic images exhibit a distinct overall appearance in the second row compared to the real images in the first row. This indicates that our model did not simply memorize the training set but possesses a strong generalization capability to generate realistic-looking images. Our proposed method and the method introduced by [10] are the only end-to-end retinal image synthesis approaches in the literature. This means that the vessels' structure is initially synthesized in the first phase, followed by the generation of synthetic images based on the generated vessel structure in the second phase. In contrast, other studies [9, 21] synthesized their retinal images using an existing vessel tree instead of random sampling from a latent space.

The vessels in the synthetic images generated by [9, 21] appear sharper and more abundant than those generated by [10]. However, despite their vessels being generated by a segmentation model rather than through synthetic generation, their vessels are comparable to those in our synthetic images. Additionally, extreme tortuosity is observed in the images generated by [9], and there are instances of missed optic discs reported in some images generated by [9, 10]. In contrast, the characteristics of our synthetic images appear more realistic when compared to [9, 10] and exhibit a closer resemblance to the real images.

We conducted further visual evaluation to verify the accuracy of anatomical features in our synthetic images, including the optic disc. Preserving the precise geometrical shape and clear boundaries of the optic disc is challenging in other studies [10, 15]. This is crucial in medical image generation as accurate representation is necessary for proper diagnosis. Medical images have extreme variations in patterns, colors, and illuminations that hinder the ability of unary GAN-based methods to generate complex image structures, according to [9]. However, our proposed framework breaks down the synthesis task into multiple generative models, each trained on a specific part of an image. Starting with the VAE models. They are weighted according to the significance of their tasks, as mentioned in Eqs. 5 and 6. In the first stage, VAEs are trained to generate unique segmentation geometry. In other words, focus on the low dimensional problem and ignore photorealism. While in the second stage, GANs are responsible for generating textures, lighting, and colors for the given geometry and then comparing it with real images. Such a pipeline framework allows the model to converge faster and perform better than a single GAN-based method in generating images’ geometries and textures. Figure 13 illustrates the optic disc generated by our model and other literary works.

Fig. 13
figure 13

Zoomed-in optic disc that is generated by our model as well as other literature works. The arrangement from top to bottom is as follows: (A) Optic disc images from the Messidor-1 dataset, (B) our synthetic disc, (C) synthetic disc produced by [10], (D) synthetic disc generated by [21], and (E) synthetic disc created by [9]

Although our artificial optic discs in Fig. 13 are not identical to the actual images, they are closer in size and shape to real optic discs than those generated by [9, 10, 21]. Additionally, our model generates sharper disc boundaries, allowing for a clear distinction between optic disc pixels and background pixels in our synthetic images. Unlike, the hazy and indistinguishable boundaries in other literature works [9, 10, 21].

4.2 Qualitative Images Evaluation

Assessing the quality of synthetic images objectively remains a challenge due to the lack of available references [20, 29, 49]. For qualitative evaluation, we used the SSIM. Three datasets containing a similar number of retinal images were used for fair evaluation: (1) Real dataset of unseen retinal images randomly taken from the Messidor-DB1 test set, (2) Baseline dataset containing retinal images by the baseline method, and (3) the dataset generated by our proposed method.

To assess the similarity and variability of synthesized images, we conducted multiple experimental comparisons. The first experiment involved estimating image variability within the same dataset. This was done by dividing each of the three datasets into two parts, performing self-comparison, and examining the standard deviation (std) for all images in the dataset. Each reported Std value in Table 2 represents the average SSIM value of all images deviating from their mean, allowing us to estimate the variability. Higher std values indicate greater variability among images.

Table 2 SSIM measures the similarity and variability of our synthetic images compared to baseline and real images

Table 2 presents the comparison results between two subsets of the same dataset, with the first three columns displaying these results. To determine a standard of comparison, we used the Real-Real column's Std value since it was calculated from real images. Our self-comparison synthetic images had a Std value that was closer to this gold standard than the self-comparison of the baseline method. However, our synthesized images exhibited greater variability in content than those generated by the baseline method. This was primarily caused by the SVV layer, which controlled vessel abundance and increased the synthesized images' variability.

On the other hand, if we compare the synthetic datasets with the actual dataset, not with each other, a lower std value indicates that the variance of the synthetic images is similar to that of the actual images (the gold standard). As shown in Table 2, our synthetic images have structural information that is much closer to the actual images since we reported a lower std value (0.0238) compared to the baseline method's std value (0.0373). The std value obtained from comparing our synthetic images to the actual images is almost the same as the std value obtained from comparing real-to-real images, with only a 0.03 margin, indicating that our images have nearly the same variance as the actual images.

The second experiment aimed to evaluate how similar the first half of the synthetic datasets was to both the second half and the real datasets. The mean SSIM value was used as a measure of similarity, with a higher value indicating greater similarity. Initially, each of the three datasets was self-compared to estimate the mean SSIM between the two sets within each dataset. In terms of the self-comparison test, the dataset generated by our model had a lower mean SSIM than the baseline method. This result indicates that our model had higher generalization capabilities than the baseline method since the images generated by our model were less similar to each other.

In contrast, our artificial images had a higher mean SSIM value of 0.8765 compared to the baseline method, which had a mean SSIM value of 0.8402, when compared to the actual dataset. This suggests that the content and overall layout of our synthetic images resemble real images more closely than those produced by the baseline method.

Our proposed model demonstrated superior performance by assigning each task to a specific generative model and weighting them, particularly when generating retinal structures that comprise the complete retinal image. This approach encourages the generative models to prioritize improving their outputs to closely resemble real images. Additionally, Fig. 14 provides a detailed summary of the comparison using a boxplot.

Fig. 14
figure 14

Shows SSIM measurements for real (R) and synthetic (S) datasets, with the orange line indicating the median values

Figure 15 displays a comparison of the visual disparities between the distribution of our synthetic images and two distributions of real images that were randomly selected from the same dataset. Finally, we obtained a KL divergence score of 7.5199 when comparing our synthetic dataset with the first distribution of the real dataset, which is very close to the KL divergence obtained by comparing the two distributions of the real dataset with only 0.9 margins.

Fig. 15
figure 15

Pixel-intensity distributions of our synthetic images and the real images sets

4.3 Quantitative Images Evaluation

Researchers suggest assessing the usefulness of synthetic images for medical applications by evaluating their impact on the segmentation performance of a model [9, 15], or by training a classifier to distinguish between real and synthetic images instead of recruiting domain experts [50]. This paper used both segmentation and classification approaches to evaluate the reliability of the synthesized images. The segmentation task employed a state-of-the-art model from [51] to verify if the synthetic images can train a model to segment vessel trees from retinal fundus images. The model was first trained on 20 real images from the training set of the publicly available DRIVE dataset [40], and then on 20 synthetic images randomly selected from our dataset, followed by testing both models' performances using the remaining 20 images from the DRIVE test set.

In Fig. 16, we compare the performance of a model trained on real images with a model trained on our synthetic images using ROC curves. The AUC score for the model trained on real images is 0.974, while the AUC score for the model trained on our synthetic images is 0.943. These results show promise, as the model trained on synthetic images performs similarly to state-of-the-art models trained on real images like [52,53,54,55].

Fig. 16
figure 16

ROC curves for models trained on real (blue) and synthetic images (red)

In the classification task, images from the real Messidor-1 dataset and the synthetic dataset, were randomly selected and labelled as 1 (real) or 0 (fake). The images were then mixed, shuffled, and split into 80% training and 20% testing sets. The trained classifier had difficulty distinguishing between real and synthetic images during testing with an accuracy of only 0.6216. Table 3 displays the reduced scores for each class in terms of precision, f1-score, and recall metrics.

Table 3 Classifier's Performance in Precision, F1-Score, and Recall Metrics to Distinguish Real and Synthetic Images

Precision and recall are presented as a recall-precision curve (Fig. 17), along with the confusion matrix (right) and the Precision-Recall curve (left).

Fig. 17
figure 17

The Precision-Recall plot (left) and Confusion matrix (right) display the classifier's performance in distinguishing real and synthetic images

5 Model’s Limitations

Despite the proposed model can generate realistic images with consistent optic disc geometry. However, there are some limitations, such as the size of the generated images is not the same as the input images. This is mainly due to limited hardware resources beside the large size of the proposed architecture forced the authors to consider \((256\times 256)\) images size instead of \((512\times 512\)), as a results, the generated images lack of high resolution. Furthermore, the reliability of the generated images still requires further validation. Although multiple quality assessments were performed, ophthalmologists must be involved in the appraisal of synthetic images, as recommended by [56] the clinical assessment is necessary to verify the reliability of the images. Lastly, The SVV layer's threshold may affect the generation of vessel structure through generating too much of very few vessels abundance that are unacceptable by the experts. Also, giving the same generation priority for the veins and arteries made it difficult to distinguish between them. Therefore, further investigations are needed in future research.

6 Conclusion

In this study, multiple VAEs and GANs were trained on Messidor-DB1, along with the proposed SVV layer to sharpen and vary vessel structure morphology. Unlike other generative models, the proposed model does not require vessel masks to synthesize images and instead samples from a predefined Gaussian distribution to generate unlimited images. In this work, we followed a new training strategy that uses 70% of the batch size to train the G, and the remaining 30% is used to train the D. The method produced more realistic image texture, sharper optic disc boundaries, and controlled overall vessel morphology. Qualitative and quantitative results showed that our synthesized images could train a segmentation model comparably to real images, promising in fulfilling the increased demand for annotated data in medical applications.

In the future, we aim to reduce the huge size of the proposed model through incorporating transfer learning such as the pre-trained VGG16 architecture instead of training the VAEs models from scratch aiming to minimize the number of trainable parameters and reduce the complexity of computation time. Furthermore, exploiting the large availability of unlabeled image data in the training stage may further improve the quality of the generated images.

7 Funding Statement

This work was funded by the Ministry of Higher Education, Malaysia via a research grant award FRGS-1–2019-ICT02-UKM-02–9. In addition to the ethical approval from the medical hospital of UKM which is referenced as UKM PPI/111/8/JEP-2021–718 on 1st Nov 2021.